CID concept is broken

Hello, this thread has a lot of information which is correct mixed with a lot of information that is not fully correct.

First there is the concept of CID. A CID is just a way to represent a hash, giving the user some extra information:

  • What type of hash it is (sha256 etc). This is the multihash part.
  • What type of IPLD-merkle dag it is referencing (which can be a “raw” type to reference content directly). This is the multiformat part.
  • How the hash is encoded for human representation (base32, base64 etc). Some CIDs are equivalent if the only thing that changes is the encoding (see how IPFS supports both Qmxxx (base58) and bafyxxx (base32) and switches interchangeably between them). This is the multibase part.
  • Qmxxx CIDs are called “V0” by the way. They are actually just multihashes without any base or type information, which is assumed to be base58/protobuf too all effects.

The whole CID concept works independently from IPFS. A CID can be used to represent a normal sha256 hash in the format you are used to see it (hex) if you want it. https://cid.ipfs.io can help making conversions etc. Also the ipfs cid subcommands.

IPFS uses CIDs because they are future proof and allow working with any time of hash/format/encoding configuration, regardless of the default hashing, dag type, encoding.

We could imagine IPFS using CIDs that just encode the “regular” sha256 sum of a file. However, as mentioned, IPFS is not content-addressing files themselves, but rather IPLD-merkle-dags. It is not that the same content can be represented by different CIDs, but rather than different DAGs are represented by different CIDs.

One of the main reasons that IPFS chunks DAG-ifies large files is because you want to verify content as you move it around in the network. If you did not chunk a 1GB file, you will need to make a full download before verifying that it corresponds to what was requested. This would enable misbehaved peers to consume too many resources from others. For this reason, IPFS nodes refuse to move blocks larger than 1 or 2 MB on the public network. Of course, private IPFS networks can be adapted to whatever and you could make IPFS not chunk at all.

Also, with smaller chunks, a large 1GB which is similar to another 1GB can be deduplicated. If they were made of a single chunk, they would not be able to share pieces.

There are other smaller reasons, like ability to better distribute the downloads and requests different chunks from different people, the ability to support DHT-lookups of bytes located in the middle of the content (i.e. seeking video without downloading from the start or relying on a provider that has the whole thing) etc. all while ensuring the content can be easily verified.

With all the above, a default which does not do any chunking, seems less reasonable than any other default. Selecting the chunking/dag algorithm per request would be disastrous for performance and security reasons.

The question of “how the dag is stored by the OS” is not very relevant as that is a lower-layer issue and can be solved there regardless. The OS/storage devices are as good/bad suited to store a DAG as they are to store different types of non-chunked content. Different datastore backends will optimize for different things as well (i.e. badger vs. fs).

Then, the question of “I have a sha256 digest and I want to find that on IPFS” can only be solved with “search engine” (be it DHT, or something else). But I find this similar to saying “I have a filename and I want to find that on the web” and complaining that the HTTP protocol does not give you that. Just like you browse the web with full URLs (and use a search engine otherwise), you will normally browse IPFS using CIDs that get you to the content you want and normally you will be referencing DAGs directly.

In the end the question is not how to translate between sha256 digest to CID, but how to translate between “thing that a human can read/remember” and CID. The only reason sha256 digests are provided next to human-readable filenames now is to be able to verify the content after download. However, IPFS embeds this functionality directly, which makes additional digests rather redundant.

So, taking into account the above, the choice of 256KB block size, with balanced DAG layout as default wrapper for content in the public IPFS network was deemed to be the safest choice when balancing a bunch of practical, security and performance concerns. Of course optimizing just for deduplication, or just for discovery, results in other choices and the good thing is that IPFS is architecturally designed to deal with different choices, even if the public network sets some limits.

6 Likes

But how?
How can you get that sha256 without downloading the file?
Is there some dag-tree-traversal magic possible to get that original hash out of it?

Someone has to tell you, and then you need to trust the source of that info, and then you need to verify it is correct once you have the full file. And by that time, you don’t need it anymore because if you got the full file from IPFS it is already verified. This is why I don’t see too many upsides in worrying about full sha256 digests when operating on IPFS-land.

As long as you’re only in IPFS-land it’s a non-issue but it seems to come up if say you want to download an iso and they publish a sha256 but sadly not a CID. You think, “Hey, I bet someone else has already put this on IPFS and if they did I’d like to download it from there”.

I’m not sure how you’d do that. I guess you could have some soft of search engine that published file hashes (which I think we’re referring to as content hashes but I find that term to be ambiguous) but they’d have to download the entire file to do the hash so it would be nice if there was a way of the original publisher to include the hash when they publish it. It would also be a nice way for clients that use gateways to verify the file since they can’t do it without the Merkel DAG and must trust the gateway.

I also wanted to add that the post’s title is probably more confrontational and antagonistic than intended.

EDIT: Even if a publisher somehow included the hash of an entire file you’d still have to trust that they were telling the truth so there is the possibility of publishing a number of files with an incorrect hash forcing people to download to verify.

1 Like

Sure, fair point.

Please do enlighten me a little further with regards to what exactly is hashed.
As, right now, i’m still living in the assumption that somewhere down the ipfs chain the file in it’s entirety is hashed using sha256. As if you’d be calling sha256sum somefile on linux.

Or, and this is entirely possible too, is the file as a whole never hashed and is it only hashed in chunks? That would be in those 256KB blocks.

If it’s the former, the ipfs network somehow somewhere must know that sha256 hash of the file in it’s entirety.
If it’s the later then i’ve learned something new yet again :slight_smile:

As, right now, i’m still living in the assumption that somewhere down the ipfs chain the file in it’s entirety is hashed using sha256. As if you’d be calling sha256sum somefile on linux.

That is, I believe, mostly incorrect. It would be somewhat correct for a raw block where if the CID used sha256 you could extract that hash and I would match the sha256 of the file that contains that block.

I agree with you @zacharywhitley. Everything about IPFS docs and the “content is identified by it’s hash” claim turns out to be false based on this new info, which I’m just learning myself also, because of this post/thread.

I had always assumed I could convert a SHA256 directly to a CID, and it’s quite disappointing that this can’t be done.

The original poster @ngortheone is right, CID is broken, because it’s not real CID. True “content addressing” would mean I can take a hash of any data, and then simply try to access the data from IPFS, without needing to first HAVE the entire data (or know someone who does, to get a CID). With true “content addressing” there would be no need for a search engine to solve this (as zachary points out), because it would just be a normal DHT hash lookup.

Luckily IPFS “can” quite easily fix this, if they want. They simply need to enforce that their DHT can map every SHA-256 to a Qm… as a hashing function. It belongs inside IPFS. It doesn’t belong in search engines. Maybe there’s a way to accomplish this by calling the DHT directly, and it doesn’t require code change? I don’t know enough about the DHT to be able to answer that.

1 Like

The content of whatever the CID retrieves is what is hashed. If the CID gets you a protobuf of some kind, that is what is hashed. if the CID gets you a piece of a DAG (which I think is actually a protobuf), then that is what is hashed. If the CID gets you a piece of a file, then that is what is hashed. If the CID gets you back an arbitrary chunk of bytes stored directly by an application, then that is what is hashed.

In other words, if you has what is returned by a CID, you can KNOW that you got the right bytes if the hashes match.

I believe you’ll find that is correct.

IMHO, knowing the hash of the entire file is completely useless unless you’re needing to verify the assembled content after downloading the whole thing. But by the time you’ve done that with IPFS, you can be assured that the individual pieces were verified and therefore the whole should be valid as well.

Oh, and the misleading/confusing thing is that gateways will take a CID and fetch multiple other CIDs behind the scenes before delivering you the actual content. So when that happens, you really don’t see the fact that the CID you originally provided actually retrieved a protobuf (I think) of some kind that contained either the actual data (for really small files, again, I think) or a list of CIDs for the various chunks that actually make up the file. The entire thing is assembled by the gateway and returned, even though the original CID actually only specified the starting point for that retrieval.

So, a CID hashes its own content, and nothing but that content. It’s just that you might not be getting only that content depending on what you’re using to retrieve the CID.

IPFS is composed of layers upon layers upon layers, but fundamentally the statement is true, at the lowest level. A CID hashes whatever that single CID contains. It’s the higher layers that confuse the issue by returning you MORE than what the actual CID specifies. And when storing a file, again a higher level, the returned CID isn’t actually the file’s hash (as you’ve learned), but the hash of the data structures used to represent and store the chunk(s) and metadata of the file, each of which actually is composed of one or more other (hidden) CIDs.

And the IPFS team I’m certain would welcome a pull request the implements your request!

1 Like

I took a quick look at the Anatomy of a CID from the creators of IPFS itself.

That too makes you believe that the CID is a digest of your file.
It makes you believe it’s a sort of 1-to-1 mapping while in reality it’s potentially only such a mapping when your file fits within the block size. As soon as chunking kicks in it’s just not a mapping anymore.

1 Like

A perfect use case was given above: If someone knows the SHA256 of a Linux ISO and they would like to pull it from IPFS if available. That’s how “content addressing” should work, and how it’s “represented” in most IPFS docs. Simply by knowing the hash of some data I should be able to request it. That’s the simplest possible use case of “content addressing” and should work.

The first problem that pops into my mind, I’m sure there are more, is validation. A malicious actor that doesn’t like you, knowing that you’re looking for that iso, could publish a file with that sha256 but incorrect data. You’d end up downloading the entire file just to find out it’s corrupt. As previously mentioned, if you did handle these as single large files there’s the problem of moving the files around the network and you would no longer be able to deduplicate at anything smaller than the file level.

Thanks everyone for responses!

Hello, this thread has a lot of information which is correct mixed with a lot of information that is not fully correct.

yes, this is me being not fully up to speed with nomenclature, sorry.

The question of “how the dag is stored by the OS” is not very relevant

Yes, I am experimenting with a storage layer that stores the file intact and all the metadata about the file (DAG) is going to be stored in an sqlite database. The lowest nodes of the DAG will not have the data, only offsets to 256k chunks in the file. For that I need a way to connect CID to a file via file’s hash sum (sha/blake/…) because the files are stored by their hash sum. (e.g. ~/.ipfs/storage/sha256-xxxxyyyyzzzz)

This has a number of benefits:

  • a file can be used by other applications without ipfs cat,
  • performance is better because we don’t need file_size/chunk_size more system calls to get it.
  • allows to store a single immutable copy of the file on disk if I want to use the file and share it via IPFS - less disk is used

A drawback:

  • block de-duplication property is somewhat lost. But I trust my ZFS pool to perform de duplication.

After looking a bit deeper I got inspired by this:

using ipfs-go If I do this I can get the data section of the root node
(the CID refers to a FreeBSD iso file I added to IPFS a few posts earler)

$ ipfs dag get QmShzonja4XdJsWTwDtfA6so1LgEqxjBkffZg3LgcNEvf1 | jq -r .data  | base64 -d | protoc --decode_raw
1: 2
3: 4466092032
4: 45613056
4: 45613056
4: 45613056

Here we see raw protobuf encoding of the data section, which can be easily decoded by looking at https://github.com/ipfs/go-unixfs/blob/master/pb/unixfs.proto

Root node already has some metadata 1: 2 - this is a file
3: 4466092032 and it has a size of 4466092032 bytes which is correct.

Protobuf file already has a metadata section where MIME can be stored, so I think the easiest way would be to extend metadata section to include a multihash of the file itself… something along the lines of

message Metadata {
	optional string MimeType = 1;
+   optional repeated string Hash = 2;
}

That sounds a bit like what I imagine IPFS add --no-copy is doing.

Any server can send corrupted data that doesn’t match the hash they claim it matched so as far as risk from malicious code a lookup via SHA256 isn’t more risky than a lookup by CID, because the only way to verify is to hash the data yourself after you get it.

So verification and retrieval are two separate things. “Trustless” systems don’t guarantee you can never read malicious data, they only guarantee that you can verify it yourself.

1 Like

IPFS limits the block size so at least it’s bounded.

1 Like

I think it’s possible to also turn off the chunker completely and make IPFS hash an entire file into a single SHA256 hash, but of course that still doesn’t (??) imply there’s a way to look up such a file if you start by knowing ONLY it’s hash and nothing else.

I don’t think you can do a lookup. Even in that happy day scenario.

What you might be able to do is construct a CID from that sha256 hash. You might just have all the information then to construct one. I’m not at all sure though.

1 Like

On converting sha256-digest to hash manually:

  1. obtain the digest (hex string)
  2. 01551220+digest (prefix as a raw sha256 multihash)
  3. Convert to base32
  4. Prepend B (or b when using all lowercase)

Example:

1 - echo hi | sha256sum -> 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
2 - 0155122098ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
3 - becomes AFKREIEY5JXE6ILPF62LNH77TM5EJBBMHBUGZJUF6P2V3REMLU73CED34Q in base32
4 - cid is BAFKREIEY5JXE6ILPF62LNH77TM5EJBBMHBUGZJUF6P2V3REMLU73CED34Q (lowercase or uppercase is the same)

Result can be verified at CID Inspector | IPFS

Programatically it is, of course, also possible using the cid/multihash libraries.

ipfs get BAFKREIEY5JXE6ILPF62LNH77TM5EJBBMHBUGZJUF6P2V3REMLU73CED34Q should also work.

2 Likes