CID concept is broken

So here’s an unexpected twist of events!
As i usually do, i watch the IPFS Core Implementations Weekly Sync. In there @adin was talking about the very concept of this issue (as initial posted, not the conclusion we came to).

You can find the part where he talks about it here.
You can find his quite elaborate thoughts (basically a design document) on the matter here. And a patch for the protocol change (the documentation of it, not the code) is here.

Nice work, @adin!!! :+1:

5 Likes

Thanks for that @markg85 That’s a really interesting and very important topic in IPFS I think. Appreciate you posting the links.

Thanks for calling this out! Is @adin’s proposal written up yet? I wonder how similar it is to what I wrote in Git on IPFS - Links and References - #24 by Ericson2314. Very glad to see something happen here!

It turned out there’s a technical challenge around the fact that if you map a SHA-256 to a CID, then you end up having to trust that the peer who created the mapping is being honest and not an attacker, because the cost of each false entry done by a hostile peer is that everyone has to download the entire file before discovering the hash is wrong.

I’m no expert at “logic puzzles” myself, but I don’t see a solution to this other than some kind of vote counting where lots of peers have to basically “vote” about what the correct mapping is, in order to have a hacker-proof system, and adding that kind of complexity may end up just not being worth it.

Yes the idea isn’t there isn’t just black box SHA-whatever to unrelated CID. It’s instead to exploit the streaming structure relaying blocks in reverse so that each block can be verified with just the hash of the to-be-sent block being trusted. That’s not perfect, but it’s a lot better.

I couldn’t be sure, but it sounded like @adin had similar thoughts, both mentioning the structure of the SHA ones, and pointing out that the tree structure of blake ones could allow things that are even better.

Thanks @markg85 this issue has been circulating in various forms for a long time. Some of the issues and ideas linked in the PR are from 2017.

Overall the difficulty is in if there is a way to do this verifiably utilizing the existing major hash functions (e.g. SHA-2-256). I’m hoping that the proposal is sound, but want to get more eyes to verify. If it’s determined to be sound we can see what it’ll take to implement the changes and where it falls on our roadmap, but having a sound proposal (assuming it is sound) is a step forward.


@Ericson2314 yep, my proposal (which I assume you found, but is in @markg85’s post above) is based off of that discussion on Git on IPFS (I linked to it in the proposal).

I’d like more eyes to verify that the proposed approach is sound, but as I’ve written in that PR it’s looking pretty good as there is a formal definition of the problem we were concerned about (i.e. freestart collisions) and that it looks like SHA2 is currently safe from it. I’d still like to flesh out more deeply what goes wrong if freestart collisions exist but full collisions are still infeasible. I suspect the attack is both not that bad and, without the existence of freestart preimages, is expensive for the attacker to pull off. Understanding some of the details here would allow us to evaluate whether enabling this behavior for SHA-1 is a possibility or whether Git would have to wait for SHA-2 support.

One thing that IMO is a pretty important extension/addition to the prior versions of this proposal is taking the approach referenced in RFC|BB|L2 - Speed up Bitswap with GraphSync CID only query that basically allows us to efficiently download a linear graph across multiple peers by being a bit more trusting and scaling our trust over time. Without this downloading a large block (e.g. a 10GiB file referenced by the SHA2 of the file) would take a ridiculously long time and be restricted to downloading data from a single peer.

Yep, if you check out the Blake paper they basically mention our use case.


Using the DHT to store unverified data

The approaches mentioned here are understandable as well given that they unleash a lot of flexibility (e.g. I claim A is a gzipped version of B), but as many of you have noticed there are some difficulties associated with needing to download a lot of data over an untrusted p2p network before realizing that you have received the data you expected. In some use cases this might not be a big problem, but in the general case things can become quite difficult.

It’s also worth noting that these unverified or Web of trust style solutions can unlock other functionality too. For example, I claim that A and B are the same book but translated into different languages. This is even less verifiable than the gzip example above, but has its use case.

I wouldn’t want to discourage folks from experimenting with implementing some of these ideas, but IMO if we can solve some of the more pressing problems in a verifiable that approach that’ll be easier and keep the mental models people have when working with IPFS intact.

That being said, if you’re interested in ways you can build new record types on top of the existing DHT without proposing a system wide protocol upgrade my comment RFC|BB|L2-09/10 Handling arbitrarily large blocks and Treating files as large blocks by aschmahmann · Pull Request #29 · protocol/beyond-bitswap · GitHub might be helpful (and if you’d like to go into this more I’d recommend creating a new discuss topic and tagging me :smiley: )

2 Likes

Oh, I somehow missed the hackmd and PR being linked above! Since this thread has gone through a bunch of topics, I’ll take the conversation there.

@ngortheone First of all, you’re correct that a CID the content of address of the original file, and that this can be confusing and sometimes inconvenient, particularly if multiple people independently pin the same file but with different hashing/chunking algorithms. This can indeed be wasteful, but I don’t see it being a major problem, since I would imagine in most cases people would pin an existing version of the file that is on IPFS (thereby using the same hash/chunking algorithm), and the case where different people instead add it from the orginal file would be less common. Worst case scenario, you have 2-3 versions of the same file available.

While not a perfect analogy, you could also complain that saving an image in different file formats is going to result in different files. That’s just how data storage works, and different image formats each have their own merits, as do different hash algorithms and chunking strategies.

There are aspects of the CID design I consider overkill, in particular multibase, i.e. the ability to express a given binary CID (array of bytes) as text in different ways. I think they should have just picked on encoding and gone with it - a lowest common demoniator that’s going to work everywhere, like base16 or base32. Multihash can also result in confusion because the same hash can be represented in different ways such that it’s not obvious that multiple text encodings are actually the same hash.

Multihash on the other hand is more useful. Many protocols/tools over the years have been designed to use a specific hash function, e.g. md5 for bittorrent or sha1 for git. Those hash algorithms have since been found to be insecure. Git is in the process of transitioning away from SHA-1, but the transition is way, way, way more complicated in that if Git had been designed from the beginning to support multiple hash algorithms. IPFS is designed to cater for extensibility and evolution of the protocol, and while this does make things a bit more complex to deal with right now (e.g. choosing which hash algorithm too use), in the long run it’s likely to pay off.

Finally, the DAG services an important purpose. The hash of each block/node is computed independently, and when downloading blocks a receipient can verify whether the block has the correct content based on its hash, without having to wait for the whole file to download. And if there is an error in a block, it can try to re-download just that block again, either from the same peer or another. Suppose you had an 8gb file; if you only had hashing done at a file level, you’d have to grab the whole 8gb again because you don’t know which block(s) were corrupt.

Note that most filsystems already use a DAG (or rather tree) - see for example single/double/triple indirection nodes in a variety of Unix filsystems. A “file” is an abstraction over the block store which uses a DAG to store its layout, and this is true whether you’re using BSD unix from the 1980s or IPFS in 2021.

I think the design rationale that lead to some of the above properties of IPFS isn’t really well explained and can definitely be confusing to newcomers. I spent a lot of time thinking “what the hell is this mess” when I started out trying to understand how IPFS works, but over time came to better understand some of the decisions (not that I agree with them all). I think improved documentation would be really valuable in this regard.

2 Likes

I wonder if Probabilistically Checkable Proofs (PCPs) could be used to publish verifiable mappings of SHA256(file) => CID(file) https://www.pepper-project.org