Should we profile CIDs?

I mean, who is brave enough to use CIDv1s? Because those exact reasons apply to cidv1s and implementions are always going to have problems with new codecs etc.

Speaking in general, the v1 release allows us to set new defaults and we should use the opportunity to do whatever we need to address user feedback now better than later, so we may as well change default encoding (to json even, because wtf parsing protobufs and cbors is the least friendly thing ever). Dag-pb itself is a big umbrella with file-nodes, directory nodes, hamt nodes etc. and cids give no hints. So, what infos are worth embedding in a CID(v2), if we could choose? You could use codecs as profiles or look for other ways.

Specifically about the chunking params, if not in the CID, this info could also travel as metadata in the root dag-pb node just like there is size metadata.

We could also do meta-cids consisting of an identity cid with metadata (profile) and a link to the actual root cid.

In terms of developer experience, understanding/implementing dag-pb itself in order to read a file is pretty hard. And cbor is worse. Perhaps we need dumb, human-readable formats too.

2 Likes

Good question. Did we only care for raw blocks to work on ipfs.io public gateway + latest Kubo + IPFS Desktop to support it? Or did we keep Kubo’s ipfs add at CIDv0 for a very very very long time partially to ensure everyone upgraded to version that supports raw codec in addition to dag-pb? Unsure what is the answer. But we know that today, version distribution in public swarm makes it relatively safe to switch to CIDv1 with raw leaves any time we want.

But, if we add a new codec foo today, we will have at least six to 12 months where half of the public network unable to read data created with it, so we should be extra sure the value added to ecosystem is worth the pain.

If we leverage extra Metadata node from existing UnixFS, or follow what was done in UnixFS 1.5 (extra optional fields in root blocks), we could be backward-compatible (old clients will be able to access files and dirs) and still solve many problems in client

From the top if my head:

  • parameters used for creating the dag (persisting “profile” info)
  • total length of raw user data (improving basic UX in places like directory listings, which currently show only the total size that includes dag-pb envelopes)
  • hash(es) of raw user data (e.g. md5 / sha2 without chunking) to improve interop with non-IPFS systems that already have hashes
1 Like

The FDroid folks ended up building ipfs-cid for this purpose, and got it into Debian so there was a straightforward way of getting a CID from a binary. I haven’t checked but I’m guessing they used the Kubo defaults (or maybe I should say the Kubo-2022 profile?).

It’d be nice to include it as one base for this, as its simplicity and existence in Debian-derivatives makes it easy as a way to introduce these ideas.

2 Likes

The problem here is establishing hash equivalency, sometimes across different systems. CIDs get a lot of flak here because it seems like they promise portability, but they fall short. The reason they fall short is they don’t include all the possible configuration required to reproduce the resulting hash if you had the raw input data.

So here my question is do we attempt to do a CIDv2 that packs all the information into the CID as @hector suggests (probably we should), or should we also establish a mechanism for talking about hash equivalency?

Ultimately, I think we need a way to talk about hash equivalency. The underlying problem is that most people expect the same raw data to produce the same CID (I imagine by default they expect a SHA256 of all raw data taken as a whole) which is simply not the case and never will be. Encoding the UnixFS params in CID v2 makes more CIDs rather than less. We will always have many different CIDs for the same data.

My suggestion is to introduce a level of trust through signed attestations of CID equivalency.

A data structure might look like this:

{
   "original": "SHA256 of raw"
   "unixFS": [
      {
         "CID": "root~cid"
         "chunkingParams": {
           // ....
         }
      },
      // ... you could have more here
   ], 
   "blake3": "blake3~cid",
   "pieceCID": "filecoin~piece~cid",
   "attested_by": "some~pub~key~ideally~the~original~data~auther"
   "signature": "signature-bytes"
}

I’m just spitballing a structure here – we actually use UCANs in web3 storage for some of this but I’m not super a fan since they’re actually just attestations. But hopefully the above illustrates exactly how many CIDs we might actually want to tie together – there are a bunch.

Of course now you’re trusting whomever created this attestation until you fetch the data. But ultimately, you’re always trusting before you fetch the data caveat some incremental viability. And, depending on the data itself, there may be a higher level of trust in the person who signed this data than fetching from a random peer. Personally, I have such an attestation for a Linux ISO signed by the pub key of the group that produces it, I’m inclined to relax my incremental verifiability requirements at transport time (and still verify incrementally against maybe a UnixFS tree).

Moreover, once you fetch the data, you might produce an additional attestation you sign, so now you have a bunch of people saying “these two are the same” and at some point you establish a decent level of trust.

Anyway that’s my 2c :slight_smile:

2 Likes

Oh, nice. I’ve also just learned of python-libipld, a similarly minimal library to work with CIDs, CARs, and DAG-CBOR.

This has been a great discussion so far. Seems like there is strong (not unanimous) agreement on the current challenges, and general enthusiasm for hash equivalency and cleaner libraries/interfaces. I propose that we:

  1. Move profiles into IPIP process - Agree on a minimal set of profiles & names, plus process for canonicalizing. This may include working with kubo to update the name of test-cid-v1. We could have this discussion here, and move to PR to specs repo when there’s a more concrete proposal.
  2. Consider more minimal implementations - Start a new thread about leaner IPFS implementations with key functions: hash, hash --profile, verify, [others].
  3. Host some sessions at IPFS Bangkok in Nov to work on these in person, in parallel with async/online.
1 Like

I wouldn’t. IMO, this information should be file-level metadata (e.g., UnixFS metadata). This kind of metadata is very relevant to files/archiving, but much less relevant to, e.g., application/dapp data.

I also want to note that there are really two ways of using IPFS/IPLD.

  1. Import data from outside IPFS/IPLD. In this case, “profiles” make sense because you need to know how that data was imported.
  2. IPLD/IPFS Native: the files/data-structures were constructed as IPLD from the very beginning. In this case, there’s no obvious way to say, e.g., “this data was chunked with X” because the data may have evolved over time. Think MFS.

I want to be careful here because, ideally, we’d live more in world 2 than world 1 (obviously not the case today).

3 Likes

I find it challenging to understand the use case for having profile/algorithmic info in the CID or (even more of a stretch) metadata in a root node or a metadata node hanging off the root.

You have the original data, and you have a CID you want to match. But you don’t have info on how that CID was generated (otherwise you could replicate it by applying the same profile). You don’t want to fetch the DAG (because if you did you can deduce whether it matches regardless of how it was chunked or what type of nodes were used etc.). But you are OK with either: large CIDs; fetching the root node; or fetching the root node and another node. And then your tool would come back with: yep that’s the right CID or no, I came up with this other CID.

Do I have this right?

I’d like to underline something brought up by @stebalien - “the data may have evolved over time.” A project I use non-professionally to incrementally update an IPFS version of a directory tree… when I change how a node is arranged (usually replacing a CBOR subtree with a link in order to fit within my preferred block size) I don’t touch any part the tree that’s not a direct ancestor of the block that needed to be changed.

What if one day someone did something similar, but was smart about it, so they used a chunker that preferences early bytes of a file for Video, but uses something more standard for text and the nodes used for directories shift based on the size of the directory and… do all of those threshold and switches need to be encoded in the profile, and if so is the profile now complicated enough that we don’t want it shoved into the CID? Perhaps if it’s a metadata node you could repeat the node in subtrees where the decision changes, but then the verifier still needs to fetch an arbitrarily fraction of the DAG - why not get all of it?Are the tradeoffs really worth it?

It might it make sense to think of CAR files as specific to the former set of use cases and alien to the latter, right? I’m not sure a CAR file has to be referent of every CID, but it’s also a sensible default for many usecases (and as such worth refining/iterating/hardening). I love this framing, though, because be have to balance 1 against 2. If only 1 mattered, maximally-verbose profiles being mandatory at the “top level” or ingress/egress points of file-based systems would make perfect sense to mandate as universal, while I am partial to not breaking any userspace in IPLD land. The tricky bit is how much flexibility to retain in terms of laying tracks in front of the IPLD-native train… it is possible to err too far on the side of the #2 use cases as well.

Maybe the trick here would be to have profiles clearly defined at time of CID generation/encoding/ingress, but not strictly coupled to CID[v2] OR to CAR files? Profiles are worth elaborating in more detail anyways, is my intuition.

In @alanshaw (of Storacha’s) 4-year-old, archived-and-moved-into-/ipfs-inactive/-github-org library “ipfs-hash-only”, there were many many options exposed for how to generate CIDs-- compare this to the four properties tweaked by kubo’s preconfigs! I remember discussing this with Alan in Istanbul, not everything this API allows one to manually tweak needs to be manually tweakable, but perhaps it’s worth breaking out each profile into Chunking strategy, DAG-Building strategy, etc? (and of course, having “null” be an option for both!)

Yes, but the convo is about how it is not possible to readily reproduce a CID from “a piece of data” without more information that is currently stored nowhere (and could be stored in the CID).

I agree. One problem is how the ipld encoding (protobuf) and the file format (unixfs) are coalesced into a single ipld codec in this case (unlike cbor etc.). So file-level metadata could go into the ipld codec if we understand codecs to mean file formats and not just ipld data blob encoding. Right?

1 Like

i would also add that one recurring theme in these discussions is that CIDs that point to a DAG of IPLD and CIDs that point to a logical unit of UnixFS files/directories are fairly distinct use-cases and mixing up the codecs leads to a lot of confusion and implicit assumptions. perhaps it’s a good idea to keep a cleanup of multicodec in mind as one possible step in defining more robust profiles. For instance, registering a new codec for a UnixFS with a few invariants or config variables hardcoded, or a new codec for some more verbose/explicit UnixFS envelope/metadata file, whether that be CARvNext or something different altogether, would go a long way in making a profile for a more foolproof UnixFS usage…

We’ve put up a simple profile at dasl.ing.

The goal is to align on as small a set of primitives as possible that are as simple as possible so that using content-addressing is a thing you spend maybe an hour figuring out once, and then just use without having to think about it. It’s kind of a gateway drug to IPFS.

This is new and subject to change — feedback very welcome!

If you want to join a friendly informal talk about it, there’s an upcoming event: lu.ma/jo7wbgqz

See the initial discussion on Bluesky.

2 Likes

FWIW, I updated the dag builder visualiser defaults to use the same parameters as test-cid-v1 used in Kubo.

Specifically: sha2-256, 1 MiB chunks, raw leaves, and 174 max children.

https://dag.ipfs.tech/

4 Likes

Here’s an update on DASL and other topics discussed here. DASL Update for Q1

1 Like

Also, DASL is an incomplete solution the problems described above because it doesn’t address chunking and DAG width for larger files.

So we are still pursuing profiling to get predictable, comparable CIDs.

Here is a summary table of current defaults, thanks to input & clarifications from @danieln @achingbrain @lidel:

Helia default Kubo default Storacha default “test-cid-v1” profile DASL
CID version CIDv1 CIDv1 CIDv1 CIDv1 CIDv1
Hash Algo sha-256 sha-256 sha-256 sha-256 sha-256
Chunk size 1MiB 256KiB 1MiB 1MiB not specified
DAG width 1024 174 (but it’s complicated*) 1024 174 not specified
DAG layout balanced balanced balanced balanced not specified
  • Kubo has 2 different default DAG widths:
    • For HAMT-sharded directories, the DefaultShardWidth here is 256.
    • For files, DefaultLinksPerBlock here is ~174

Kubo currently has no CLI / RPC / Config option to control DAG width in Kubo. ipfs add: parameter to control the width of created UnixFS DAG · Issue #10751 · ipfs/kubo · GitHub is the starting point to add that ability.

Next steps:

  1. Discuss whether 1024 or 174 width is preferred, or if it’s worth having both.
  2. Come up with a better naming system for these profiles (test-cid-v1 isn’t quite right since it covers a lot more than CID version, the test part doesn’t instill confidence, and it doesn’t really work in a series bc it’s not clear whether v1 refers to the cid version or a profile version)
  3. Identify someone to land ipfs add: parameter to control the width of created UnixFS DAG · Issue #10751 · ipfs/kubo · GitHub
3 Likes

Adding some additional context as I’ve been diving into this

The state of UnixFS in JavaScript/TypeScript

Most developers use higher-level libraries that depend on these, and have slightly different defaults

Defaults and naming profiles

Naming things is hard. Moreover, understanding the trade-offs in different UnixFS options is far from obvious to newcomers sold on content addressing.

In thinking about this for users not familiar with internals, the conclusion I came to is that we should lean more heavily on defaults to guide users to the happy path, which ensures CID equivalency given the same inputs.

As for naming, my initial suggestion was to name the profile unixfs-v1-2025 denoting the year it was ratified. This is grounded in the insight that consensus around conventions can change over time, though not that often.However, I realise the shortcomings of this approach, it carries no information about the specifics of the profile, so the actual parameters will likely need to live in the spec. Finally, with time, this might feel “outdated”.

I should also note that I don’t think CIDv2 packing this information is pragmatic. This will be a breaking change that I don’t think the ecosystem will embrace, leading to more fragmentation and confusion.

Another approach could be to name profiles based on the key UnixFS/CID parameters:

  • CID version
  • hash function
  • layout, e.g. balanced, trickle
  • chunk-size
  • dag width
  • raw blocks
  • HAMT threshold (I’d need to dive deeper into whether there’s that much variance around this)

For example v1-sha256-balanced-1mib-1024w-raw.

Long and convoluted, but encapsulates the information.

HAMT and autosharding

HAMTs are used to chunk UnixFS directories blocks that contain so many links that result in the block being above the a certain chunk size.

Almost all implementations use HAMT fanout of 256. This refers to the number “sub-shards” or “ShardWidth”

Implementations vary in how determine whether to use a HAMT. Some support autosharding, where they automatically shard based on an estimate of the block size (counting the size of PNNode.Links).

  • Kubo/Boxo uses a size based parameter (HAMTShardingSize) of 256KiB, where 256KiB is an estimate of the blocksize based on the size of all links/names. An estimate is used (rather than the actual blocksize) to avoid needing to serialise the Protobuf just to measure size.
    • go-unixfsnode (used by go-car and extensively by filecoin ecosystem) also autoshards like Boxo/Kubo
  • Helia and ipfs/js-ipfs-unixfs use the same approach as Kubo (discussion, and this comment). The config is shardSplitThresholdBytes which defaults to 256KiB
  • ipld/js-unixfs which the Storacha tools: ipfs-car and w3up depend on, doesn’t implement autosharding (open issue. Consumers of the library like ipfs-car and w3up trigger HAMT sharding once a directory has 1000 links.

Other sources of CID divergence

Empty folders

  • Kubo’s ipfs add command and Helia’s unixfs.addAll (with globSource) add empty folders.
  • w3 and ipfs-car both ignore empty folders (they both depend on storacha/files-from-path which only returns files and ignores empty folders).

This means that if your input contains empty folders that resulting CID will be different, even if all other settings are the same.

This brings up the question of why you would even include empty directories in the DAG.

My suggestion would be to account for this as we define the profile settings whether empty directories are included

1 Like