Thereās no way other than comparing the data itself. I mean, you can always generate a SHA-256 hash of one file and then compare that to the other, if you wonāt even want to be streaming them both in simultaneously.
Ok, I see what you mean. If thereās enough info in the multi hash prefix (like āQmā) to be able to know all parameters required to setup so you can generate that hash still using just the CID_aās data which you have, then your idea is good. However, even doing that āgenerateā of a CID to check could be inefficient, because afaik youāll have to write to your store rather than just read from the store.
ā¦unless theās a way to just āgenerateā a hash from scratch from some data without writing the data. Tons of people had asked for that hash-calculation function with varying degrees of success, but still I like your idea/thinking.
In other words just reading the CID_b to check it byte by byte may be the faster algorithm, because to do any of the generation of CIDs for any file youāll have to read the file at least.
And this might be common enough to actually BE a core feature in IPFS itself too. A function to just compare two CIDs.
You can use cid.ipfs.io to ācompareā. The main thing to discern if two CIDs correspond to the same data would be the multihash, everything else is metadata on top.
The ipfs cid subcommands allow converting as well between bases etc.
Now, if the multihash is of different type (i.e. different sha), you will have no other option than to re-add the original file using the right multihash type.
Also, files can be added in different ways, with different chunk sizes etc, so it may well be that that both hashes point to the same file, but they are different, and you donāt find a way to reproduce how the hashes were generated.
This is because the hashes do in fact not correspond to the file, they correspond to the DAG that represents the file in IPFS. The final option then is just downloading both hashes and comparing the files directly.
The interesting and useful question here, is whether thereās a guaranteed way to do this without writing data to disk ever. That is: Can I detect if two CIDs point to identical data without having to, even temporarily, actually STORE one or both of the fileās data in my IPFS cache.
For example if I want to compare two URLs on the network (non IPFS ones), I can just stream them in both byte-by-byte and compare without ever consuming more than say 8192 bytes of memory, and without writing even a single byte to disk.
If youāre really lazy like me you could just create a tmpfs and init an ipfs repo on that and use it as you normally would and nothing should ever hit disk. (assuming you donāt have too much memory pressure and write to swap)
Thanks. I had always assumed that it might just be writing it to disk (or whatever datastore it was using) and did the equivalent of a gc just when it was done.