CID concept is broken

zacharywhitley · January 5, 2021, 3:12pm

As, right now, i’m still living in the assumption that somewhere down the ipfs chain the file in it’s entirety is hashed using sha256. As if you’d be calling sha256sum somefile on linux.

That is, I believe, mostly incorrect. It would be somewhat correct for a raw block where if the CID used sha256 you could extract that hash and I would match the sha256 of the file that contains that block.

wclayf · January 5, 2021, 3:23pm

I agree with you @zacharywhitley. Everything about IPFS docs and the “content is identified by it’s hash” claim turns out to be false based on this new info, which I’m just learning myself also, because of this post/thread.

I had always assumed I could convert a SHA256 directly to a CID, and it’s quite disappointing that this can’t be done.

The original poster @ngortheone is right, CID is broken, because it’s not real CID. True “content addressing” would mean I can take a hash of any data, and then simply try to access the data from IPFS, without needing to first HAVE the entire data (or know someone who does, to get a CID). With true “content addressing” there would be no need for a search engine to solve this (as zachary points out), because it would just be a normal DHT hash lookup.

Luckily IPFS “can” quite easily fix this, if they want. They simply need to enforce that their DHT can map every SHA-256 to a Qm… as a hashing function. It belongs inside IPFS. It doesn’t belong in search engines. Maybe there’s a way to accomplish this by calling the DHT directly, and it doesn’t require code change? I don’t know enough about the DHT to be able to answer that.

ldeffenb · January 5, 2021, 3:32pm

The content of whatever the CID retrieves is what is hashed. If the CID gets you a protobuf of some kind, that is what is hashed. if the CID gets you a piece of a DAG (which I think is actually a protobuf), then that is what is hashed. If the CID gets you a piece of a file, then that is what is hashed. If the CID gets you back an arbitrary chunk of bytes stored directly by an application, then that is what is hashed.

In other words, if you has what is returned by a CID, you can KNOW that you got the right bytes if the hashes match.

I believe you’ll find that is correct.

IMHO, knowing the hash of the entire file is completely useless unless you’re needing to verify the assembled content after downloading the whole thing. But by the time you’ve done that with IPFS, you can be assured that the individual pieces were verified and therefore the whole should be valid as well.

ldeffenb · January 5, 2021, 3:35pm

Oh, and the misleading/confusing thing is that gateways will take a CID and fetch multiple other CIDs behind the scenes before delivering you the actual content. So when that happens, you really don’t see the fact that the CID you originally provided actually retrieved a protobuf (I think) of some kind that contained either the actual data (for really small files, again, I think) or a list of CIDs for the various chunks that actually make up the file. The entire thing is assembled by the gateway and returned, even though the original CID actually only specified the starting point for that retrieval.

So, a CID hashes its own content, and nothing but that content. It’s just that you might not be getting only that content depending on what you’re using to retrieve the CID.

ldeffenb · January 5, 2021, 3:43pm

IPFS is composed of layers upon layers upon layers, but fundamentally the statement is true, at the lowest level. A CID hashes whatever that single CID contains. It’s the higher layers that confuse the issue by returning you MORE than what the actual CID specifies. And when storing a file, again a higher level, the returned CID isn’t actually the file’s hash (as you’ve learned), but the hash of the data structures used to represent and store the chunk(s) and metadata of the file, each of which actually is composed of one or more other (hidden) CIDs.

And the IPFS team I’m certain would welcome a pull request the implements your request!

markg85 · January 5, 2021, 3:51pm

I took a quick look at the Anatomy of a CID from the creators of IPFS itself.

That too makes you believe that the CID is a digest of your file.
It makes you believe it’s a sort of 1-to-1 mapping while in reality it’s potentially only such a mapping when your file fits within the block size. As soon as chunking kicks in it’s just not a mapping anymore.

wclayf · January 5, 2021, 4:06pm

A perfect use case was given above: If someone knows the SHA256 of a Linux ISO and they would like to pull it from IPFS if available. That’s how “content addressing” should work, and how it’s “represented” in most IPFS docs. Simply by knowing the hash of some data I should be able to request it. That’s the simplest possible use case of “content addressing” and should work.

zacharywhitley · January 5, 2021, 4:22pm

The first problem that pops into my mind, I’m sure there are more, is validation. A malicious actor that doesn’t like you, knowing that you’re looking for that iso, could publish a file with that sha256 but incorrect data. You’d end up downloading the entire file just to find out it’s corrupt. As previously mentioned, if you did handle these as single large files there’s the problem of moving the files around the network and you would no longer be able to deduplicate at anything smaller than the file level.

ngortheone · January 5, 2021, 4:23pm

Thanks everyone for responses!

Hello, this thread has a lot of information which is correct mixed with a lot of information that is not fully correct.

yes, this is me being not fully up to speed with nomenclature, sorry.

The question of “how the dag is stored by the OS” is not very relevant

Yes, I am experimenting with a storage layer that stores the file intact and all the metadata about the file (DAG) is going to be stored in an sqlite database. The lowest nodes of the DAG will not have the data, only offsets to 256k chunks in the file. For that I need a way to connect CID to a file via file’s hash sum (sha/blake/…) because the files are stored by their hash sum. (e.g. ~/.ipfs/storage/sha256-xxxxyyyyzzzz)

This has a number of benefits:

a file can be used by other applications without ipfs cat,
performance is better because we don’t need file_size/chunk_size more system calls to get it.
allows to store a single immutable copy of the file on disk if I want to use the file and share it via IPFS - less disk is used

A drawback:

block de-duplication property is somewhat lost. But I trust my ZFS pool to perform de duplication.

After looking a bit deeper I got inspired by this:

http://www.andrew.cmu.edu/user/xinyit/2019/04/11/Verify-IPFS-Multihash/

using ipfs-go If I do this I can get the data section of the root node
(the CID refers to a FreeBSD iso file I added to IPFS a few posts earler)

$ ipfs dag get QmShzonja4XdJsWTwDtfA6so1LgEqxjBkffZg3LgcNEvf1 | jq -r .data  | base64 -d | protoc --decode_raw
1: 2
3: 4466092032
4: 45613056
4: 45613056
4: 45613056

Here we see raw protobuf encoding of the data section, which can be easily decoded by looking at https://github.com/ipfs/go-unixfs/blob/master/pb/unixfs.proto

Root node already has some metadata 1: 2 - this is a file
3: 4466092032 and it has a size of 4466092032 bytes which is correct.

Protobuf file already has a metadata section where MIME can be stored, so I think the easiest way would be to extend metadata section to include a multihash of the file itself… something along the lines of

message Metadata {
	optional string MimeType = 1;
+   optional repeated string Hash = 2;
}

zacharywhitley · January 5, 2021, 4:29pm

That sounds a bit like what I imagine IPFS add --no-copy is doing.

wclayf · January 5, 2021, 5:52pm

Any server can send corrupted data that doesn’t match the hash they claim it matched so as far as risk from malicious code a lookup via SHA256 isn’t more risky than a lookup by CID, because the only way to verify is to hash the data yourself after you get it.

So verification and retrieval are two separate things. “Trustless” systems don’t guarantee you can never read malicious data, they only guarantee that you can verify it yourself.

zacharywhitley · January 5, 2021, 6:05pm

IPFS limits the block size so at least it’s bounded.

wclayf · January 5, 2021, 6:50pm

I think it’s possible to also turn off the chunker completely and make IPFS hash an entire file into a single SHA256 hash, but of course that still doesn’t (??) imply there’s a way to look up such a file if you start by knowing ONLY it’s hash and nothing else.

markg85 · January 5, 2021, 8:28pm

I don’t think you can do a lookup. Even in that happy day scenario.

What you might be able to do is construct a CID from that sha256 hash. You might just have all the information then to construct one. I’m not at all sure though.

hector · January 6, 2021, 12:19pm

On converting sha256-digest to hash manually:

obtain the digest (hex string)
01551220+digest (prefix as a raw sha256 multihash)
Convert to base32
Prepend B (or b when using all lowercase)

Example:

1 - echo hi | sha256sum -> 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
2 - 0155122098ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
3 - becomes AFKREIEY5JXE6ILPF62LNH77TM5EJBBMHBUGZJUF6P2V3REMLU73CED34Q in base32
4 - cid is BAFKREIEY5JXE6ILPF62LNH77TM5EJBBMHBUGZJUF6P2V3REMLU73CED34Q (lowercase or uppercase is the same)

Result can be verified at CID Inspector | IPFS

Programatically it is, of course, also possible using the cid/multihash libraries.

ipfs get BAFKREIEY5JXE6ILPF62LNH77TM5EJBBMHBUGZJUF6P2V3REMLU73CED34Q should also work.

markg85 · January 7, 2021, 8:36pm

While your example works, i cannot get it working on an actual file on ipfs.
Take this file as an example bafybeibsxwamwl3uc52ua5565cgol3xmo3cf5oswpnsammmxmksiytxlda (just a 2kb css file).

sha256 output of that file: 6d8cf75be944a00e478c4d1fe7d4a28bee572177d2ba22b22c8bf2664a0633c7
With the prefix it becomes: 015512206d8cf75be944a00e478c4d1fe7d4a28bee572177d2ba22b22c8bf2664a0633c7
in base32: AFKREIDNRT3VX2KEUAHEPDCND7T5JIUL5ZLSC56SXIRLELEL6JTEUBRTY4
Prefix with B becomes: BAFKREIDNRT3VX2KEUAHEPDCND7T5JIUL5ZLSC56SXIRLELEL6JTEUBRTY4

The CID inspector tells me that the original one has a multicodec (dag-pb) whereas this composed one has a raw multicodec.

A dag-pb with sha256 would probably give a working CID, any tip on how to do that? Yet you now still have to rely on the world to use that default CID otherwise you will still not find your content (like i cannot find that css file with the raw sha256 multihash prefix).

But then again, how likely are we to ever get in that situation?
It would be nice to be able to compose a CID that obeys the defaults. That in particular would be neat for sites that advertise the sha256 checksums of files (mostly happening in the Linux world like live iso’s/releases i think) to be able to download them via IPFS.

Lastly, will this trick work files bigger then the blocksize? I somehow doubt it… But i’ve been wrong more then once in this thread so i’m wary of claiming anything now

wclayf · January 7, 2021, 10:30pm

Remember an actual file (like in MFS) is not the same kind of thing as a DAG entry. Here’s the table of multicodec values:

github.com

multiformats/multicodec/blob/master/table.csv

name,                           tag,            code,           description
identity,                       multihash,      0x00,           raw binary
cidv1,                          ipld,           0x01,           CIDv1
cidv2,                          ipld,           0x02,           CIDv2
cidv3,                          ipld,           0x03,           CIDv3
ip4,                            multiaddr,      0x04,
tcp,                            multiaddr,      0x06,
sha1,                           multihash,      0x11,
sha2-256,                       multihash,      0x12,
sha2-512,                       multihash,      0x13,
sha3-512,                       multihash,      0x14,
sha3-384,                       multihash,      0x15,
sha3-256,                       multihash,      0x16,
sha3-224,                       multihash,      0x17,
shake-128,                      multihash,      0x18,
shake-256,                      multihash,      0x19,
keccak-224,                     multihash,      0x1a,           keccak has variable output length. The number specifies the core length
keccak-256,                     multihash,      0x1b,
keccak-384,                     multihash,      0x1c,
keccak-512,                     multihash,      0x1d,

This file has been truncated. show original

Maybe that’s enough info to help ya, but if not hector can fill in the blanks.

hector · January 7, 2021, 10:41pm

Unless you add it with --raw-leaves, ipfs will wrap even small files in dag-pb nodes. By default, ipfs chunks files over 256KB, but you can set the chunker to 1 or 2 MB. libp2p refuses to move chunks larger than that, even though you can produce them.

markg85 · January 7, 2021, 11:42pm

Ah, thought so. Glad a suspicion is finally correct
But that does effectively mean that this “create a CID if you know the digest and get it that way” idea doesn’t work on IPFS as is.

If one gets it to work it’s would be a private network of IPFS nodes which is kinda defeating the purpose.
Is it safe to assume that IPFS isn’t going to change this? Or in other terms, is it safe to assume that if one would want a mapping of n-digest-algorithms -> CID that it would be best done as a “metadata” layer on top of IPFS?

I’m asking because i’m curious if that would be the right approach. I personally have no use for it other then the already super edge cases mentioned in this thread. I’d put this quite low on a nice-to-have feature list to be honest (but on it nevertheless!)

@wclayf That partially helps. I’m assuming (again) that the dag-pb type has a few extra bytes besides just the code.

wclayf · January 8, 2021, 2:28am

I wonder if BitTorrent, BTFS, WebTorrent, etc or something else can provide a way to download a file purely from it’s SHA-256, or if this is just such a difficult problem that mankind hasn’t solved it yet.

Topic		Replies	Views
Monolithic File Hash , a proposal Help	6	318	June 30, 2021
How can I link to a different kind of hash-based system Help ipld	12	1826	October 21, 2017
Un-informed ipfs user asks. What cid codec to use for introducing multi-part file? Help	4	566	May 12, 2019
Does the IPFS chunking change the CID for the same file chunked differently? Docs & Tutorials	2	910	June 26, 2021
About ipfs add and MFS Help go-ipfs , files	5	775	October 2, 2021

CID concept is broken

Related topics