Comparing two different CIDs derived from the same file

Hi all,

For any given file, we can generate different CIDs of the file by using different parameters.

E.g.,

  • By different CID versions (e.g., CIDv0 vs CIDv1)
  • By different Multibase (e.g., base58btc vs base32)
  • By different Multicodec (e.g., dag-pb vs raw)
  • By different Multihash algorithms (e.g., sha2-256 vs sha3-256)

If I get two different CIDs, how do I verify that they both are derived from the same file or not?

Any suggestion is welcome. Thanks!

Thereā€™s no way other than comparing the data itself. I mean, you can always generate a SHA-256 hash of one file and then compare that to the other, if you wonā€™t even want to be streaming them both in simultaneously.

Hi @wclayf, thanks for your input!

Agree with you that if we have two original files on hand, comparing their SHA-256 hashes is direct.

The scenario of the challenge Iā€™m fighting is that

  • I have one original file and its CID (CID_a) in the beginning.
  • I will receive another CID (CID_b) which is different from the CID_a. The CID_b claims that it is also derived from the original file.

Currently, my idea is to check if the Multihash in CID_a & CID_b are the same:

  • If yes, think that CID_b points to the original file
  • If no
    • If the Multihash params (hash func, block size, etc.) are different, generate a Multihash using the original file and the Multihash params in CID_b
    • If the Multihash params are the same, think that CID_b points to different file

Iā€™m figuring out if this is a feasible way.

Ok, I see what you mean. If thereā€™s enough info in the multi hash prefix (like ā€˜Qmā€™) to be able to know all parameters required to setup so you can generate that hash still using just the CID_aā€™s data which you have, then your idea is good. However, even doing that ā€œgenerateā€ of a CID to check could be inefficient, because afaik youā€™ll have to write to your store rather than just read from the store.

ā€¦unless theā€™s a way to just ā€œgenerateā€ a hash from scratch from some data without writing the data. Tons of people had asked for that hash-calculation function with varying degrees of success, but still I like your idea/thinking.

In other words just reading the CID_b to check it byte by byte may be the faster algorithm, because to do any of the generation of CIDs for any file youā€™ll have to read the file at least.

And this might be common enough to actually BE a core feature in IPFS itself too. A function to just compare two CIDs.

1 Like

You can use cid.ipfs.io to ā€œcompareā€. The main thing to discern if two CIDs correspond to the same data would be the multihash, everything else is metadata on top.

The ipfs cid subcommands allow converting as well between bases etc.

Now, if the multihash is of different type (i.e. different sha), you will have no other option than to re-add the original file using the right multihash type.

Also, files can be added in different ways, with different chunk sizes etc, so it may well be that that both hashes point to the same file, but they are different, and you donā€™t find a way to reproduce how the hashes were generated.

This is because the hashes do in fact not correspond to the file, they correspond to the DAG that represents the file in IPFS. The final option then is just downloading both hashes and comparing the files directly.

The interesting and useful question here, is whether thereā€™s a guaranteed way to do this without writing data to disk ever. That is: Can I detect if two CIDs point to identical data without having to, even temporarily, actually STORE one or both of the fileā€™s data in my IPFS cache.

For example if I want to compare two URLs on the network (non IPFS ones), I can just stream them in both byte-by-byte and compare without ever consuming more than say 8192 bytes of memory, and without writing even a single byte to disk.

If youā€™re really lazy like me you could just create a tmpfs and init an ipfs repo on that and use it as you normally would and nothing should ever hit disk. (assuming you donā€™t have too much memory pressure and write to swap)

You can add without writing to disk:

ipfs add --only-hash ...

-n, --only-hash bool - Only chunk and hash - do not write to disk.

1 Like

Thanks. I had always assumed that it might just be writing it to disk (or whatever datastore it was using) and did the equivalent of a gc just when it was done.