There’s no way other than comparing the data itself. I mean, you can always generate a SHA-256 hash of one file and then compare that to the other, if you won’t even want to be streaming them both in simultaneously.
Ok, I see what you mean. If there’s enough info in the multi hash prefix (like ‘Qm’) to be able to know all parameters required to setup so you can generate that hash still using just the CID_a’s data which you have, then your idea is good. However, even doing that “generate” of a CID to check could be inefficient, because afaik you’ll have to write to your store rather than just read from the store.
…unless the’s a way to just “generate” a hash from scratch from some data without writing the data. Tons of people had asked for that hash-calculation function with varying degrees of success, but still I like your idea/thinking.
In other words just reading the CID_b to check it byte by byte may be the faster algorithm, because to do any of the generation of CIDs for any file you’ll have to read the file at least.
And this might be common enough to actually BE a core feature in IPFS itself too. A function to just compare two CIDs.
You can use cid.ipfs.io to “compare”. The main thing to discern if two CIDs correspond to the same data would be the multihash, everything else is metadata on top.
The ipfs cid subcommands allow converting as well between bases etc.
Now, if the multihash is of different type (i.e. different sha), you will have no other option than to re-add the original file using the right multihash type.
Also, files can be added in different ways, with different chunk sizes etc, so it may well be that that both hashes point to the same file, but they are different, and you don’t find a way to reproduce how the hashes were generated.
This is because the hashes do in fact not correspond to the file, they correspond to the DAG that represents the file in IPFS. The final option then is just downloading both hashes and comparing the files directly.
The interesting and useful question here, is whether there’s a guaranteed way to do this without writing data to disk ever. That is: Can I detect if two CIDs point to identical data without having to, even temporarily, actually STORE one or both of the file’s data in my IPFS cache.
For example if I want to compare two URLs on the network (non IPFS ones), I can just stream them in both byte-by-byte and compare without ever consuming more than say 8192 bytes of memory, and without writing even a single byte to disk.
If you’re really lazy like me you could just create a tmpfs and init an ipfs repo on that and use it as you normally would and nothing should ever hit disk. (assuming you don’t have too much memory pressure and write to swap)
Thanks. I had always assumed that it might just be writing it to disk (or whatever datastore it was using) and did the equivalent of a gc just when it was done.