Should we profile CIDs?

CIDs are foundational for IPFS. However, I frequently hear from people complaining about them. Common issues I hear are:

  • It can be finicky to produce the same CID in two separate implementations.
  • Can’t verify unless you have all the hashes.
  • Too much optionality to bother implementing, it’s too big a step up from just using hashes.

This leads me to believe that there is potential to help people and increase adoption with profiling. I would be curious to hear what people here think, however.

What are your CID pain points, if any?

2 Likes

By profiling, do you mean:
1.) IPIPs recommending specific codecs and architectures/configs for specific user-stories or use-cases?
2.) "these were the defaults on X software from year A to year B, we call this profile softwareA-B and you can generate those CIDs with command ipfs add --software-a-b
3.) end-to-end profiles that can be used to build lean/low-resource versions of key components, like “private-network only, JS/TS-only E2E starter kit” type stuff?
4.) some other thing?

I mean whatever makes this easier for people to use and adopt. Any of these are acceptable so long as they address the problems that people are facing.

1 is already possible-- nothing’s stopping us from doing it for the web-verifiability use cases, and I’ve been very slowly contributing normrefs to that. we could probably bump this up the priorities list if people are asking for it.
2. there is some amount of this already done, a good IPIP structuring it out a bit more to expand coverage to more of the sensible defaults, and to simplify parity across implementations, would probably be appreciated?
3. this is a huge lift and prolly requires grants or earmarking of substantial amounts of time. i’m supportive but it might take a village.
I don’t think anyone has any particular incentive to block or poopoo any of the three, it’s more a matter of which would be the most helpful soonest (and whomst wants to donate to the grant-paid folks doing the work, and how big would said grant have to be). Any opinions useful, in the meantime.

I think that this is something that would help keep the best of both worlds.

In fact, we already have something like this in Kubo since the 0.29 release. Profiles are additive (you can apply multiple) config sets for Kubo.

We added the test-cid-v1 which changes the defaults to larger blocks and cidv1:


Can you explain what you mean?


Dropping this here as a common example of a pain point faced by users

2 Likes

I have 2 big beefs with CIDs as they are right now.

  1. There’s many good arguments for this but i feel it’s still destructive to even have it. A CID in kubo and a CID in web3.storage don’t match with the same data. You can find why, like enabling raw leaves. But in a self-describing format, which CIDs are supposed to be, this “freedom” should not exist. Data X should give CID according to the profile. Ignore the DAG part here for a moment that web3.storage uses, the point is that there is flexibility within the CID profile resulting in different hashes, that flexibility should not exist (or should be part of the CID itself).
  2. Can’t (easily) compare hashes. Specifically the technically inclined people first getting familiar with IPFS and CID are dumbfounded when they discover that sha256(big data) is not the same as the hash in their CID. It’s a concept that easily triggers when you explain that the CID is a tree and the one you see is the root of that tree. Still, not having the ability to just check is a limitation we have in IPFS-land.

Point 1 can be solved by having a defined standard. I thought CIDv0 and CIDv1 were those but apparently it still allows point 1 to exist, does it not? Whatever a format defines, point 1 must not exist. It won’t necessarily ease adaption but it will make interoperability much better, set good expectations and allows you to test implementations against each other.

Point 2 is a mindset issue. It will always persist as long as you can freely select the hashing algorithm (which should stay!). You can inflate the CID size (and call it CIDvCompat) where you’d define a standard that embeds 2 CIDs in 1 blob. It would allow you to have a CID as IPFS understands it and a CID that would translate to essentially sha256(data). In base32 your CIDvCompat would become something like 180 characters long. Ouch! And this only “informs” the user of what it represents. Nefarious nodes can still lie and verification is still up to the user.

Just having a tool, lets say cidsum, would already go a long way! However, that tool isn’t gonna be easy as long as point 1 exists. As then a hypothetical cidsum(data) == cid can still be false even when the same profile is used.

Small nuance there. Irrelevant if you use kubo commands (ipfs get CID), it self-verifies.

But if you don’t use kubo, like getting it from a gateway, then you don’t know if the CID you downloaded matches the data you expect the CID to represent.
Each CID leaf represents a block of data (that’s the blocksize). The cid of that block tells you the hash it should have. You have to verify if the two match. Rinse and repeat for all leaf nodes. You end up downloading all the data and verifying all the blocks. Once that verification is done you know that the CID you initially requested (and the data you downloaded) matches.

2 Likes

@danieln IIRC you also mentioned some challenges with memory consumption when merkel-izing large chunks of data. Can you share more about that?

One thing I would throw out is that there are still low-hanging tooling things that could make CIDs easier to work with.

If i have a cbor decode of data with a cid, I get a 42 tag, and then some hex, typically. it would be great if something like cid.ipfs.tech could more easily massage that back into a sane cid, versus the current workflow of having to write a script each time i encounter a binary cid in the wild.

I think this is a reasonable example because with any sort of hash you’re going to find non-canonical representations, and it’s about the quality of the tooling in the ecosystem for how frustrating it is when that happens.

1 Like

I was referring to storage requirements that are more than double the dataset when you add to an ipfs node or convert to a CAR file.

The reason is that we store the bytes from the source file in the leaf blocks (essentially the binary chunks of the file/data) of the DAG.

Not sure I understood. Can you make the example more concrete?

Every message in filecoin for instance is displayed in block explorers as a hex encoding of the cbor. You can use https://cbor.me/ or similar to get the structure of the messages, but cids are then in their cbor encoded format, which is a tag of 42 and then a binary string that will typically be represented in hex. it’ll look like

42(h'000181E2039220200C7C257C5EA6BBA17053362...')

None of the ipfs tooling allows me to take a hex-of-binary representation of a cid today and get back the actual cid that i can use for the next step of following what message or other piece of data is being referred to.

1 Like

There are two separate problems, one of which has nothing to do with CIDs:

The problem here is establishing hash equivalency, sometimes across different systems. CIDs get a lot of flak here because it seems like they promise portability, but they fall short. The reason they fall short is they don’t include all the possible configuration required to reproduce the resulting hash if you had the raw input data. If CIDs did that, they would be even longer, but you’d be able to compare the encoded CID with another to detect configuration mismatches.

Evidence of this is Alan chiming in on a discussion above with all the configuration details that web3.storage uses.

This is the second problem, and one that can be tackled at the library level and in communications about what CIDs do. @willscott makes the great suggestion that you can do all sorts of things at the library level for parsing CIDs in ways that are specific to a given system. It’s not required that you drag around all the hash functions that CIDs can point to. Instead CIDs make it possible to implement only the ones you need, and spit out nice errors for the ones you don’t support.

It would be wise to adjust communication to point out that that CIDs spit out nice strings that give you a slightly better starting point for working across different content addressed systems, and the right way to go about using that power is to parse only for the subset of possible CIDs that your system wants to work with. It’s only “slightly better” because CIDs don’t encode hashing configuration, but better than nothing in that you get to pack all sorts of useful info into that string on the encoding side, and can do smarter things on the decoding side.

1 Like

That’s not really true, see Go Playground - The Go Programming Language.

What you seem to be encountering is the result of the creators of dag-cbor choosing to encode CIDs not in their binary form, but in their string form but with an identity multibase. Most people’s use of tooling is around the binary form, or a human-readable string form so the use of the identity multibase can be pretty confusing. In the hex case if you replace the first 00 with an F most other tooling works fine as well.

I’ve yet to find someone using the identity multibase in a way I found natural, but perhaps that’s just me.

I also wrote a similar 3 lines, and do so each time i trace a filecoin transaction. that it’s a “common” way that cids are surfacing and i need to write 3 lines in the golang playground each time i encounter it is something i consider a failing of the tooling. I’ve looked at the javascript side of the interface that the cid inspector is written in and it seemed sufficiently non-intuitive that i haven’t yet become sufficiently annoyed to submit a patch to allow decoding this form there.

but that i can’t get to a canonical form from a search is a lacking of tooling for our custom format compared to other base encoding things where there are plenty of online tools.

2 Likes

I hear from users and churned potential users who just want to reliably generate deterministic CIDs and run into 2 main issues:

  • Preprocessing: Data, especially media, often gets preprocessed before it gets turned into a CID. If preprocessing is different, so is the resulting CID. Standard profiles for preprocessing might be very useful.
  • There is no small, clean library to generate CIDs. multiformats almost does this, if you’re willing to dive into the weeds of chunking and building blocks. We often point people to kubo, but that’s a lot of software to get a CID. A library that only generates CIDs in a single call (eg cid.($FILEPATH) and offers some basic transcoding tools would be much easier to choose, vet, and contribute to.

Some paraphrased user quotes:

  • “People want the smallest amount of code that turns a directory on a disk into a .car file.”
  • “I just want to make a cid in Javascript and a cid in Rust and have them match.”
1 Like
  1. smallest, cleanest CID generator to date is probably w3up, right?
  2. in THEORY if we do these profiles things right, JS and Rust and Go SHOULD be able to always produce the same CID for a given input GIVEN ALL THREE were configured to the same profile

In my personal experience, most of the time “reproducible CIDs” are mentioned, users mean CIDs of UnixFS data, which was already chunked in opinionated way by Kubo (go-ipfs).

The default Kubo (go-ipfs) ipfs add parameters remained the same for nearly a decade, which made people assume there is a “set in stone standard”, that importing the same data will always produce the same CID, no matter what software is used (rather than the only guarantee being that the same CID always producing the same data).

Switching Kubo’s ipfs add defaults (kubo#4143) will have a big educational impact, nudging people to learn that the same UnixFS CIDs are produced only when the same settings are used, and setting explicit “cid profile” if their use case depends on that.

FYSA we’ve made some progress towards allowing users to customize “profile of settings” that impact produced CID when files and directories are turned into UnixFS DAG.

Kubo v0.29 introduced:

Kubo 1.x release (ETA TBD, hopefully sooner than later) will switch Kubo’s ipfs add “CID profile” from legacy CIDv0 to new CIDv1 (finally closing kubo#4143).

There is also aspect of content-type-aware chunkers (videos, images, archives) for UnixFS , which I won’t go into here (see example: WARC file chunking), but we should be aware of, because content-type aware chunkers will grow the number of possible “cid profiles”.


Q: Is there any actionable thing we could do today to make “profiles” a thing?

For the purpose of this discussion, test-cid-v1 and legacy-cid-v0 presets from Kubo could used as examples of “cid profiles” @robin hinted at, but those settings are hard to discover if someone is new to ecosystem.

Would it be useful to have “CID Conventions” section at https://specs.ipfs.tech/ as a way of disseminating information about involved settings to implementers that care about 1:1 reproducible CIDs? We seem to have enough “profiles” in the wild to make it worthwhile: “Kubo CIDs”, “Iroh CIDs”, “LUCIDs” etc

2 Likes

Idea: why don’t we abuse CID codecs to store chunking information? For example unixfs gets multiple codecs for common, sane configs: i.e. unixfs-1MB-raw-eaves-hamt-folders, unixfs-256kb etc…

In practice CIDs use what, 3 codecs? Cbor, unixfs/dagpb/json…

More pain than gain?

Abusing and adding new codecs to UnixFS spec feels like a breaking change because it introduces interop problems every time a new profile codec is added:

Reading DAGs:

  • Old implementations will not be able to access UnixFS DAGs that use codecs other than raw and dag-pb. Users will see “unknown codec error”. Bad for end users.
  • New implementations that want to be future-proof now have to attempt to parse unknown codecs as dag-pb. Bad for implementers and wider ecosystem, because UnixFS becomes the “default and mandatory fallback”.

Creating DAGs:

  • Who will be brave enough to start producing CIDs with these new codecs to risk interop problems on reading path across ecosystem?
    • My guess: it ends up being a dead spec, the UnixFS defaults in apps and services that produce CIDs likely will remain raw + dag-pb “until interop is solved” which never happens due to never ending papercuts when custom codecs are used.

And there is always the governance problem, over time we end up with many dead codes.

1 Like

I’m not sure what you mean by this, maybe say a little more? I think if you mean “optimistically assume” chunking profile/architecture/source by codec when fetching, maybe? if you’ve got a bitswap-incompatible codec maybe skip the DHT check and go straight to IPNI, for example…

I like this part-- there should probably be multiple different UnixFS profiles and at least one non-UnixFS/“whole file” profile, for starters. Not sure how many profiles are really needed urgently but I assume it’s a few more than 3?

Indeed, if we apply to IETF and get WG status we have to choose how many multicodec entries get classified as “FINAL” at the genesis of the IANA registry. I am of the opinion that the only rows that should have this “FINAL” concrete poured around them are the ones that have end-to-end profiles written around them, and everything else be “contingent” or “vendor”-level reservations… this keeps it possible to have relatively lightweight implementations that only implement, say, 1 or 2 profiles, and a basic implementation that only implements “FINAL” codecs.