CID concept is broken

Hello, this thread has a lot of information which is correct mixed with a lot of information that is not fully correct.

First there is the concept of CID. A CID is just a way to represent a hash, giving the user some extra information:

  • What type of hash it is (sha256 etc). This is the multihash part.
  • What type of IPLD-merkle dag it is referencing (which can be a “raw” type to reference content directly). This is the multiformat part.
  • How the hash is encoded for human representation (base32, base64 etc). Some CIDs are equivalent if the only thing that changes is the encoding (see how IPFS supports both Qmxxx (base58) and bafyxxx (base32) and switches interchangeably between them). This is the multibase part.
  • Qmxxx CIDs are called “V0” by the way. They are actually just multihashes without any base or type information, which is assumed to be base58/protobuf too all effects.

The whole CID concept works independently from IPFS. A CID can be used to represent a normal sha256 hash in the format you are used to see it (hex) if you want it. https://cid.ipfs.io can help making conversions etc. Also the ipfs cid subcommands.

IPFS uses CIDs because they are future proof and allow working with any time of hash/format/encoding configuration, regardless of the default hashing, dag type, encoding.

We could imagine IPFS using CIDs that just encode the “regular” sha256 sum of a file. However, as mentioned, IPFS is not content-addressing files themselves, but rather IPLD-merkle-dags. It is not that the same content can be represented by different CIDs, but rather than different DAGs are represented by different CIDs.

One of the main reasons that IPFS chunks DAG-ifies large files is because you want to verify content as you move it around in the network. If you did not chunk a 1GB file, you will need to make a full download before verifying that it corresponds to what was requested. This would enable misbehaved peers to consume too many resources from others. For this reason, IPFS nodes refuse to move blocks larger than 1 or 2 MB on the public network. Of course, private IPFS networks can be adapted to whatever and you could make IPFS not chunk at all.

Also, with smaller chunks, a large 1GB which is similar to another 1GB can be deduplicated. If they were made of a single chunk, they would not be able to share pieces.

There are other smaller reasons, like ability to better distribute the downloads and requests different chunks from different people, the ability to support DHT-lookups of bytes located in the middle of the content (i.e. seeking video without downloading from the start or relying on a provider that has the whole thing) etc. all while ensuring the content can be easily verified.

With all the above, a default which does not do any chunking, seems less reasonable than any other default. Selecting the chunking/dag algorithm per request would be disastrous for performance and security reasons.

The question of “how the dag is stored by the OS” is not very relevant as that is a lower-layer issue and can be solved there regardless. The OS/storage devices are as good/bad suited to store a DAG as they are to store different types of non-chunked content. Different datastore backends will optimize for different things as well (i.e. badger vs. fs).

Then, the question of “I have a sha256 digest and I want to find that on IPFS” can only be solved with “search engine” (be it DHT, or something else). But I find this similar to saying “I have a filename and I want to find that on the web” and complaining that the HTTP protocol does not give you that. Just like you browse the web with full URLs (and use a search engine otherwise), you will normally browse IPFS using CIDs that get you to the content you want and normally you will be referencing DAGs directly.

In the end the question is not how to translate between sha256 digest to CID, but how to translate between “thing that a human can read/remember” and CID. The only reason sha256 digests are provided next to human-readable filenames now is to be able to verify the content after download. However, IPFS embeds this functionality directly, which makes additional digests rather redundant.

So, taking into account the above, the choice of 256KB block size, with balanced DAG layout as default wrapper for content in the public IPFS network was deemed to be the safest choice when balancing a bunch of practical, security and performance concerns. Of course optimizing just for deduplication, or just for discovery, results in other choices and the good thing is that IPFS is architecturally designed to deal with different choices, even if the public network sets some limits.

6 Likes