Should we profile CIDs?

I mean, who is brave enough to use CIDv1s? Because those exact reasons apply to cidv1s and implementions are always going to have problems with new codecs etc.

Speaking in general, the v1 release allows us to set new defaults and we should use the opportunity to do whatever we need to address user feedback now better than later, so we may as well change default encoding (to json even, because wtf parsing protobufs and cbors is the least friendly thing ever). Dag-pb itself is a big umbrella with file-nodes, directory nodes, hamt nodes etc. and cids give no hints. So, what infos are worth embedding in a CID(v2), if we could choose? You could use codecs as profiles or look for other ways.

Specifically about the chunking params, if not in the CID, this info could also travel as metadata in the root dag-pb node just like there is size metadata.

We could also do meta-cids consisting of an identity cid with metadata (profile) and a link to the actual root cid.

In terms of developer experience, understanding/implementing dag-pb itself in order to read a file is pretty hard. And cbor is worse. Perhaps we need dumb, human-readable formats too.

2 Likes

Good question. Did we only care for raw blocks to work on ipfs.io public gateway + latest Kubo + IPFS Desktop to support it? Or did we keep Kubo’s ipfs add at CIDv0 for a very very very long time partially to ensure everyone upgraded to version that supports raw codec in addition to dag-pb? Unsure what is the answer. But we know that today, version distribution in public swarm makes it relatively safe to switch to CIDv1 with raw leaves any time we want.

But, if we add a new codec foo today, we will have at least six to 12 months where half of the public network unable to read data created with it, so we should be extra sure the value added to ecosystem is worth the pain.

If we leverage extra Metadata node from existing UnixFS, or follow what was done in UnixFS 1.5 (extra optional fields in root blocks), we could be backward-compatible (old clients will be able to access files and dirs) and still solve many problems in client

From the top if my head:

  • parameters used for creating the dag (persisting “profile” info)
  • total length of raw user data (improving basic UX in places like directory listings, which currently show only the total size that includes dag-pb envelopes)
  • hash(es) of raw user data (e.g. md5 / sha2 without chunking) to improve interop with non-IPFS systems that already have hashes
1 Like

The FDroid folks ended up building ipfs-cid for this purpose, and got it into Debian so there was a straightforward way of getting a CID from a binary. I haven’t checked but I’m guessing they used the Kubo defaults (or maybe I should say the Kubo-2022 profile?).

It’d be nice to include it as one base for this, as its simplicity and existence in Debian-derivatives makes it easy as a way to introduce these ideas.

2 Likes

The problem here is establishing hash equivalency, sometimes across different systems. CIDs get a lot of flak here because it seems like they promise portability, but they fall short. The reason they fall short is they don’t include all the possible configuration required to reproduce the resulting hash if you had the raw input data.

So here my question is do we attempt to do a CIDv2 that packs all the information into the CID as @hector suggests (probably we should), or should we also establish a mechanism for talking about hash equivalency?

Ultimately, I think we need a way to talk about hash equivalency. The underlying problem is that most people expect the same raw data to produce the same CID (I imagine by default they expect a SHA256 of all raw data taken as a whole) which is simply not the case and never will be. Encoding the UnixFS params in CID v2 makes more CIDs rather than less. We will always have many different CIDs for the same data.

My suggestion is to introduce a level of trust through signed attestations of CID equivalency.

A data structure might look like this:

{
   "original": "SHA256 of raw"
   "unixFS": [
      {
         "CID": "root~cid"
         "chunkingParams": {
           // ....
         }
      },
      // ... you could have more here
   ], 
   "blake3": "blake3~cid",
   "pieceCID": "filecoin~piece~cid",
   "attested_by": "some~pub~key~ideally~the~original~data~auther"
   "signature": "signature-bytes"
}

I’m just spitballing a structure here – we actually use UCANs in web3 storage for some of this but I’m not super a fan since they’re actually just attestations. But hopefully the above illustrates exactly how many CIDs we might actually want to tie together – there are a bunch.

Of course now you’re trusting whomever created this attestation until you fetch the data. But ultimately, you’re always trusting before you fetch the data caveat some incremental viability. And, depending on the data itself, there may be a higher level of trust in the person who signed this data than fetching from a random peer. Personally, I have such an attestation for a Linux ISO signed by the pub key of the group that produces it, I’m inclined to relax my incremental verifiability requirements at transport time (and still verify incrementally against maybe a UnixFS tree).

Moreover, once you fetch the data, you might produce an additional attestation you sign, so now you have a bunch of people saying “these two are the same” and at some point you establish a decent level of trust.

Anyway that’s my 2c :slight_smile:

2 Likes

Oh, nice. I’ve also just learned of python-libipld, a similarly minimal library to work with CIDs, CARs, and DAG-CBOR.

This has been a great discussion so far. Seems like there is strong (not unanimous) agreement on the current challenges, and general enthusiasm for hash equivalency and cleaner libraries/interfaces. I propose that we:

  1. Move profiles into IPIP process - Agree on a minimal set of profiles & names, plus process for canonicalizing. This may include working with kubo to update the name of test-cid-v1. We could have this discussion here, and move to PR to specs repo when there’s a more concrete proposal.
  2. Consider more minimal implementations - Start a new thread about leaner IPFS implementations with key functions: hash, hash --profile, verify, [others].
  3. Host some sessions at IPFS Bangkok in Nov to work on these in person, in parallel with async/online.
1 Like

I wouldn’t. IMO, this information should be file-level metadata (e.g., UnixFS metadata). This kind of metadata is very relevant to files/archiving, but much less relevant to, e.g., application/dapp data.

I also want to note that there are really two ways of using IPFS/IPLD.

  1. Import data from outside IPFS/IPLD. In this case, “profiles” make sense because you need to know how that data was imported.
  2. IPLD/IPFS Native: the files/data-structures were constructed as IPLD from the very beginning. In this case, there’s no obvious way to say, e.g., “this data was chunked with X” because the data may have evolved over time. Think MFS.

I want to be careful here because, ideally, we’d live more in world 2 than world 1 (obviously not the case today).

3 Likes

I find it challenging to understand the use case for having profile/algorithmic info in the CID or (even more of a stretch) metadata in a root node or a metadata node hanging off the root.

You have the original data, and you have a CID you want to match. But you don’t have info on how that CID was generated (otherwise you could replicate it by applying the same profile). You don’t want to fetch the DAG (because if you did you can deduce whether it matches regardless of how it was chunked or what type of nodes were used etc.). But you are OK with either: large CIDs; fetching the root node; or fetching the root node and another node. And then your tool would come back with: yep that’s the right CID or no, I came up with this other CID.

Do I have this right?

I’d like to underline something brought up by @stebalien - “the data may have evolved over time.” A project I use non-professionally to incrementally update an IPFS version of a directory tree… when I change how a node is arranged (usually replacing a CBOR subtree with a link in order to fit within my preferred block size) I don’t touch any part the tree that’s not a direct ancestor of the block that needed to be changed.

What if one day someone did something similar, but was smart about it, so they used a chunker that preferences early bytes of a file for Video, but uses something more standard for text and the nodes used for directories shift based on the size of the directory and… do all of those threshold and switches need to be encoded in the profile, and if so is the profile now complicated enough that we don’t want it shoved into the CID? Perhaps if it’s a metadata node you could repeat the node in subtrees where the decision changes, but then the verifier still needs to fetch an arbitrarily fraction of the DAG - why not get all of it?Are the tradeoffs really worth it?

It might it make sense to think of CAR files as specific to the former set of use cases and alien to the latter, right? I’m not sure a CAR file has to be referent of every CID, but it’s also a sensible default for many usecases (and as such worth refining/iterating/hardening). I love this framing, though, because be have to balance 1 against 2. If only 1 mattered, maximally-verbose profiles being mandatory at the “top level” or ingress/egress points of file-based systems would make perfect sense to mandate as universal, while I am partial to not breaking any userspace in IPLD land. The tricky bit is how much flexibility to retain in terms of laying tracks in front of the IPLD-native train… it is possible to err too far on the side of the #2 use cases as well.

Maybe the trick here would be to have profiles clearly defined at time of CID generation/encoding/ingress, but not strictly coupled to CID[v2] OR to CAR files? Profiles are worth elaborating in more detail anyways, is my intuition.

In @alanshaw (of Storacha’s) 4-year-old, archived-and-moved-into-/ipfs-inactive/-github-org library “ipfs-hash-only”, there were many many options exposed for how to generate CIDs-- compare this to the four properties tweaked by kubo’s preconfigs! I remember discussing this with Alan in Istanbul, not everything this API allows one to manually tweak needs to be manually tweakable, but perhaps it’s worth breaking out each profile into Chunking strategy, DAG-Building strategy, etc? (and of course, having “null” be an option for both!)

Yes, but the convo is about how it is not possible to readily reproduce a CID from “a piece of data” without more information that is currently stored nowhere (and could be stored in the CID).

I agree. One problem is how the ipld encoding (protobuf) and the file format (unixfs) are coalesced into a single ipld codec in this case (unlike cbor etc.). So file-level metadata could go into the ipld codec if we understand codecs to mean file formats and not just ipld data blob encoding. Right?

1 Like

i would also add that one recurring theme in these discussions is that CIDs that point to a DAG of IPLD and CIDs that point to a logical unit of UnixFS files/directories are fairly distinct use-cases and mixing up the codecs leads to a lot of confusion and implicit assumptions. perhaps it’s a good idea to keep a cleanup of multicodec in mind as one possible step in defining more robust profiles. For instance, registering a new codec for a UnixFS with a few invariants or config variables hardcoded, or a new codec for some more verbose/explicit UnixFS envelope/metadata file, whether that be CARvNext or something different altogether, would go a long way in making a profile for a more foolproof UnixFS usage…

We’ve put up a simple profile at dasl.ing.

The goal is to align on as small a set of primitives as possible that are as simple as possible so that using content-addressing is a thing you spend maybe an hour figuring out once, and then just use without having to think about it. It’s kind of a gateway drug to IPFS.

This is new and subject to change — feedback very welcome!

If you want to join a friendly informal talk about it, there’s an upcoming event: lu.ma/jo7wbgqz

See the initial discussion on Bluesky.

2 Likes

FWIW, I updated the dag builder visualiser defaults to use the same parameters as test-cid-v1 used in Kubo.

Specifically: sha2-256, 1 MiB chunks, raw leaves, and 174 max children.

https://dag.ipfs.tech/

5 Likes

Here’s an update on DASL and other topics discussed here. DASL Update for Q1

1 Like

Also, DASL is an incomplete solution the problems described above because it doesn’t address chunking and DAG width for larger files.

So we are still pursuing profiling to get predictable, comparable CIDs.

Here is a summary table of current defaults, thanks to input & clarifications from @danieln @achingbrain @lidel:

Helia default Kubo default Storacha default “test-cid-v1” profile DASL
CID version CIDv1 CIDv1 CIDv1 CIDv1 CIDv1
Hash Algo sha-256 sha-256 sha-256 sha-256 sha-256
Chunk size 1MiB 256KiB 1MiB 1MiB not specified
DAG width 1024 174 (but it’s complicated*) 1024 174 not specified
DAG layout balanced balanced balanced balanced not specified
  • Kubo has 2 different default DAG widths:
    • For HAMT-sharded directories, the DefaultShardWidth here is 256.
    • For files, DefaultLinksPerBlock here is ~174

Kubo currently has no CLI / RPC / Config option to control DAG width in Kubo. ipfs add: parameter to control the width of created UnixFS DAG · Issue #10751 · ipfs/kubo · GitHub is the starting point to add that ability.

Next steps:

  1. Discuss whether 1024 or 174 width is preferred, or if it’s worth having both.
  2. Come up with a better naming system for these profiles (test-cid-v1 isn’t quite right since it covers a lot more than CID version, the test part doesn’t instill confidence, and it doesn’t really work in a series bc it’s not clear whether v1 refers to the cid version or a profile version)
  3. Identify someone to land ipfs add: parameter to control the width of created UnixFS DAG · Issue #10751 · ipfs/kubo · GitHub
3 Likes

Adding some additional context as I’ve been diving into this

The state of UnixFS in JavaScript/TypeScript

Most developers use higher-level libraries that depend on these, and have slightly different defaults

Defaults and naming profiles

Naming things is hard. Moreover, understanding the trade-offs in different UnixFS options is far from obvious to newcomers sold on content addressing.

In thinking about this for users not familiar with internals, the conclusion I came to is that we should lean more heavily on defaults to guide users to the happy path, which ensures CID equivalency given the same inputs.

As for naming, my initial suggestion was to name the profile unixfs-v1-2025 denoting the year it was ratified. This is grounded in the insight that consensus around conventions can change over time, though not that often.However, I realise the shortcomings of this approach, it carries no information about the specifics of the profile, so the actual parameters will likely need to live in the spec. Finally, with time, this might feel “outdated”.

I should also note that I don’t think CIDv2 packing this information is pragmatic. This will be a breaking change that I don’t think the ecosystem will embrace, leading to more fragmentation and confusion.

Another approach could be to name profiles based on the key UnixFS/CID parameters:

  • CID version
  • hash function
  • layout, e.g. balanced, trickle
  • chunk-size
  • dag width
  • raw blocks
  • HAMT threshold (I’d need to dive deeper into whether there’s that much variance around this)

For example v1-sha256-balanced-1mib-1024w-raw.

Long and convoluted, but encapsulates the information.

HAMTs and auto-sharding

HAMTs are used to chunk UnixFS directories blocks that contain so many links that result in the block being above the a certain chunk size.

Almost all implementations use HAMT fanout of 256. This refers to the number “sub-shards” or “ShardWidth”

Implementations vary in how determine whether to use a HAMT. Some support auto-sharding, where they automatically shard based on an estimate of the block size (counting the size of PNNode.Links).

  • Kubo/Boxo uses a size based parameter (HAMTShardingSize) of 256KiB, where 256KiB is an estimate of the blocksize based on the size of all links/names. An estimate is used (rather than the actual blocksize) to avoid needing to serialise the Protobuf just to measure size while constructing the DAG.
    • go-unixfsnode (used by go-car and extensively by filecoin ecosystem) also autoshards like Boxo/Kubo
  • Helia and ipfs/js-ipfs-unixfs use the same approach as Kubo (discussion, and this comment). The config is shardSplitThresholdBytes which defaults to 256KiB
  • ipld/js-unixfs which the Storacha tools: ipfs-car and w3up depend on, doesn’t implement autosharding (open issue. Consumers of the library like ipfs-car and w3up trigger HAMT sharding once a directory has 1000 links.

Trade-offs in DAG Widths

@mosh brought up the question of whether 1024 or 174 width is preferred.

  1. Discuss whether 1024 or 174 width is preferred, or if it’s worth having both.

As far as I understand:

  • Wide DAGs are better for retrieval speed: more blocks can be downloaded in parallel as you know about more of the DAG without having to traverse to lower layers. In other words: wider dags result in fewer hops from the root of the DAG (just metadata) to the leaf nodes (containing the actual data).
  • Wide DAGs, on the other hand, are less optimal for structural sharing (which enables caching) when updated (like in Git). So changes result in fewer nodes that need to be updated, but at a higher computational (more data to hash) and i/o (more blocks to read) cost of rehashing.
Factor Wide DAG Narrow DAG
Tree depth Shallow Deep
Verification speed Faster (parallelizable) but more CPU heavy Slower (sequential)
Node size Larger (more links) Smaller
Update cost higher (more children, i.e. bytes to hash) lower cost but more steps since updates cascade up many levels
Parallelism High Low
Deduplication Low (bad) High (good)

Addendum: @rvagg has brought to my attention the following:

Mutation cost is very significant for thicker/wider DAGs, especially if you count the storage cost of historical data - if you update a single leaf at the bottom of a DAG that has a high branching factor, then each of those huge intermediate nodes need to be replaced entirely. This is why Filecoin tends to opt for more slender structures, because we have to care about state cost and state churn. The HAMT that Filecoin uses is only ever used with a bitwidth of 5 - a branching factor of 32. There’s another structure, the AMT, used as an array, and the branching factor varies from fairly high for structures that don’t change very much all the way down to 4! So they get tall and slender. Then when you mutate at the bottom, you’re mutating all the way up but you’re not involving CIDs of many other branches in that process.

Important to consider mutation in all of these, it doesn’t tend to be something you think about once you have a static graph, but there are plenty of places where it matters. Wikipedia has always been my favourite unixfs example, consider how it would go if you were to keep the wikipedia-on-ipfs project going and update it daily. I think if that were done, you’d want to get some stats on % of churn and then model the impact on the graph. I suspect a branching factor of 256 would definitely not be ideal for that, you’d essentially poison the “dedupe” potential of IPFS in the process because almost all of the intermediate graph section would need to mutate each day thanks to the random distribution of leaves in a HAMT. Pages may not change much, but the entire structure above the leaves probably would. I can’t recall how big it is, but it’s more than a few levels deep, 256256256… heavy blobs being replaced each time.

DAG width for IPFS use-cases

IPFS’ main value proposition is resilience and verifiability for data distribution. This is often at the cost of latency (or time to first byte). Retrieving data by CID is much more involved than normal HTTP retrieval: you have to go through content and peer routing before you even to connect to someone providing the data.

All else being equal, wider DAG widths allow for reduced latency and higher parallelism which seems desired if you accept the costs: wider DAGs undermine the potential deduplication across versions thereby increasing the storage cost of historical data. A good example for this is the IPFS Wikipedia mirror, where a full snapshot can be up to 650GB. A while back the topic of deduplication for Wikipedia mirrors was discussed and including systematically measuring and improving deduplication, however, it seems that most of that work was never concluded.

Wikipedia isn’t the typical IPFS use-case. And for such use-cases, it’s totally fine to deviate from the defaults and use specialised chunking techniques and DAG layouts that are better fit for deduplication.

IPFS today isn’t really optimised or has enough tooling for versioning datasets – which would hypothetically benefit from narrower DAGs. And while storage keeps on getting cheaper, networking is still constrained by geographical distances and the speed of light.

So let’s embrace wider DAGs as the default and make sure we communicate these trade-offs!

Other sources of CID divergence

Empty folders

  • Kubo’s ipfs add command and Helia’s unixfs.addAll (with globSource) add empty folders.
  • w3 and ipfs-car both ignore empty folders (they both depend on storacha/files-from-path which only returns files and ignores empty folders).

This means that if your input contains empty folders that resulting CID will be different, even if all other settings are the same.

This brings up the question of why you would even include empty directories in the DAG.

My suggestion would be to account for this as we define the profile settings whether empty directories are included

6 Likes

I have a quick update to share.

We’ve made progress and now the following implementations will yield the same CID for UnixFS given the same input (with the exception of empty folders which we should probably make configurable)

Here you can see it in action

# Add with Kubo 0.35
➜  helia-101 git:(main) ✗ ipfs --version
ipfs version 0.35.0
➜  helia-101 git:(main) ✗ ipfs config profile apply test-cid-v1-wide > /dev/null
➜  helia-101 git:(main) ✗ ipfs add -r test
added bafkreie4vttikpz5imd5ccfoanozyohpabvxlgovbjr5d4w7y3xep5m6ki test/index.spec.js
added bafybeifrc2vrh76j7dccg2hgihoy66su7jw2vvxoihrswevbdaazlquhpq test/war-and-peace.html
added bafybeiahi2rfez66oxcyeyrwniq7kktzeqf3bkffuyktbpwxlppw3k6b7i test
 3.85 MiB / 3.85 MiB [===================================================================================================================================] 100.00%

# Add with helia and @helia/unixfs
➜  helia-101 git:(main) ✗ node helia-add-unixfs.js test '**/*' helia.car
CID(bafkreie4vttikpz5imd5ccfoanozyohpabvxlgovbjr5d4w7y3xep5m6ki) /index.spec.js raw
CID(bafybeifrc2vrh76j7dccg2hgihoy66su7jw2vvxoihrswevbdaazlquhpq) /war-and-peace.html file
CID(bafybeiahi2rfez66oxcyeyrwniq7kktzeqf3bkffuyktbpwxlppw3k6b7i)  directory
Wrote car file to helia.car

# Add with ipfs-car v3
➜  helia-101 git:(main) ✗ ipfs-car -v
ipfs-car, 3.0.0
➜  helia-101 git:(main) ✗ ipfs-car pack test > ipfs-car.car
bafybeiahi2rfez66oxcyeyrwniq7kktzeqf3bkffuyktbpwxlppw3k6b7i
➜  helia-101 git:(main) ✗ ipfs-car ls ipfs-car.car --verbose
bafybeiahi2rfez66oxcyeyrwniq7kktzeqf3bkffuyktbpwxlppw3k6b7i	-	.
bafkreie4vttikpz5imd5ccfoanozyohpabvxlgovbjr5d4w7y3xep5m6ki	1373	./index.spec.js
bafybeifrc2vrh76j7dccg2hgihoy66su7jw2vvxoihrswevbdaazlquhpq	4039247	./war-and-peace.html
# Here you can see the blocks 
➜  helia-101 git:(main) ✗ ipfs-car blocks ipfs-car.car (
bafkreie4vttikpz5imd5ccfoanozyohpabvxlgovbjr5d4w7y3xep5m6ki
bafkreib4xkrayqwopys7y4zv27baf2ga3yiswfitnfssruqafer6hqbktu
bafkreieu3cnito5c7ebd4ceri3dvhoy5xw73vvf4nut4v2jualhqlftb3q
bafkreibrvwq4goy4dmqviggvxvzty7wuqzwztggl7szkwht2kdek4ngmri
bafkreib433qkwzzkpxa2tt5cst5gb5scfu32t3tptt34mxa22i7ttdc5iy
bafybeifrc2vrh76j7dccg2hgihoy66su7jw2vvxoihrswevbdaazlquhpq
bafybeiahi2rfez66oxcyeyrwniq7kktzeqf3bkffuyktbpwxlppw3k6b7i

This already puts us already in a much better place. Ongoing spec work can be found in the specs repo: IPIP 0499: CID Profiles by mishmosh · Pull Request #499 · ipfs/specs · GitHub

2 Likes

Just reading through the latest Kubo updates, and it looks like the addition of the DAG width option resolves my issues noted in my post.

I’ll be doing some testing to confirm, but hopefully that’s the case!

2 Likes