Should we profile CIDs?

I mean, who is brave enough to use CIDv1s? Because those exact reasons apply to cidv1s and implementions are always going to have problems with new codecs etc.

Speaking in general, the v1 release allows us to set new defaults and we should use the opportunity to do whatever we need to address user feedback now better than later, so we may as well change default encoding (to json even, because wtf parsing protobufs and cbors is the least friendly thing ever). Dag-pb itself is a big umbrella with file-nodes, directory nodes, hamt nodes etc. and cids give no hints. So, what infos are worth embedding in a CID(v2), if we could choose? You could use codecs as profiles or look for other ways.

Specifically about the chunking params, if not in the CID, this info could also travel as metadata in the root dag-pb node just like there is size metadata.

We could also do meta-cids consisting of an identity cid with metadata (profile) and a link to the actual root cid.

In terms of developer experience, understanding/implementing dag-pb itself in order to read a file is pretty hard. And cbor is worse. Perhaps we need dumb, human-readable formats too.

2 Likes

Good question. Did we only care for raw blocks to work on ipfs.io public gateway + latest Kubo + IPFS Desktop to support it? Or did we keep Kubo’s ipfs add at CIDv0 for a very very very long time partially to ensure everyone upgraded to version that supports raw codec in addition to dag-pb? Unsure what is the answer. But we know that today, version distribution in public swarm makes it relatively safe to switch to CIDv1 with raw leaves any time we want.

But, if we add a new codec foo today, we will have at least six to 12 months where half of the public network unable to read data created with it, so we should be extra sure the value added to ecosystem is worth the pain.

If we leverage extra Metadata node from existing UnixFS, or follow what was done in UnixFS 1.5 (extra optional fields in root blocks), we could be backward-compatible (old clients will be able to access files and dirs) and still solve many problems in client

From the top if my head:

  • parameters used for creating the dag (persisting “profile” info)
  • total length of raw user data (improving basic UX in places like directory listings, which currently show only the total size that includes dag-pb envelopes)
  • hash(es) of raw user data (e.g. md5 / sha2 without chunking) to improve interop with non-IPFS systems that already have hashes
1 Like

The FDroid folks ended up building ipfs-cid for this purpose, and got it into Debian so there was a straightforward way of getting a CID from a binary. I haven’t checked but I’m guessing they used the Kubo defaults (or maybe I should say the Kubo-2022 profile?).

It’d be nice to include it as one base for this, as its simplicity and existence in Debian-derivatives makes it easy as a way to introduce these ideas.

2 Likes

The problem here is establishing hash equivalency, sometimes across different systems. CIDs get a lot of flak here because it seems like they promise portability, but they fall short. The reason they fall short is they don’t include all the possible configuration required to reproduce the resulting hash if you had the raw input data.

So here my question is do we attempt to do a CIDv2 that packs all the information into the CID as @hector suggests (probably we should), or should we also establish a mechanism for talking about hash equivalency?

Ultimately, I think we need a way to talk about hash equivalency. The underlying problem is that most people expect the same raw data to produce the same CID (I imagine by default they expect a SHA256 of all raw data taken as a whole) which is simply not the case and never will be. Encoding the UnixFS params in CID v2 makes more CIDs rather than less. We will always have many different CIDs for the same data.

My suggestion is to introduce a level of trust through signed attestations of CID equivalency.

A data structure might look like this:

{
   "original": "SHA256 of raw"
   "unixFS": [
      {
         "CID": "root~cid"
         "chunkingParams": {
           // ....
         }
      },
      // ... you could have more here
   ], 
   "blake3": "blake3~cid",
   "pieceCID": "filecoin~piece~cid",
   "attested_by": "some~pub~key~ideally~the~original~data~auther"
   "signature": "signature-bytes"
}

I’m just spitballing a structure here – we actually use UCANs in web3 storage for some of this but I’m not super a fan since they’re actually just attestations. But hopefully the above illustrates exactly how many CIDs we might actually want to tie together – there are a bunch.

Of course now you’re trusting whomever created this attestation until you fetch the data. But ultimately, you’re always trusting before you fetch the data caveat some incremental viability. And, depending on the data itself, there may be a higher level of trust in the person who signed this data than fetching from a random peer. Personally, I have such an attestation for a Linux ISO signed by the pub key of the group that produces it, I’m inclined to relax my incremental verifiability requirements at transport time (and still verify incrementally against maybe a UnixFS tree).

Moreover, once you fetch the data, you might produce an additional attestation you sign, so now you have a bunch of people saying “these two are the same” and at some point you establish a decent level of trust.

Anyway that’s my 2c :slight_smile:

2 Likes

Oh, nice. I’ve also just learned of python-libipld, a similarly minimal library to work with CIDs, CARs, and DAG-CBOR.

This has been a great discussion so far. Seems like there is strong (not unanimous) agreement on the current challenges, and general enthusiasm for hash equivalency and cleaner libraries/interfaces. I propose that we:

  1. Move profiles into IPIP process - Agree on a minimal set of profiles & names, plus process for canonicalizing. This may include working with kubo to update the name of test-cid-v1. We could have this discussion here, and move to PR to specs repo when there’s a more concrete proposal.
  2. Consider more minimal implementations - Start a new thread about leaner IPFS implementations with key functions: hash, hash --profile, verify, [others].
  3. Host some sessions at IPFS Bangkok in Nov to work on these in person, in parallel with async/online.
1 Like

I wouldn’t. IMO, this information should be file-level metadata (e.g., UnixFS metadata). This kind of metadata is very relevant to files/archiving, but much less relevant to, e.g., application/dapp data.

I also want to note that there are really two ways of using IPFS/IPLD.

  1. Import data from outside IPFS/IPLD. In this case, “profiles” make sense because you need to know how that data was imported.
  2. IPLD/IPFS Native: the files/data-structures were constructed as IPLD from the very beginning. In this case, there’s no obvious way to say, e.g., “this data was chunked with X” because the data may have evolved over time. Think MFS.

I want to be careful here because, ideally, we’d live more in world 2 than world 1 (obviously not the case today).

3 Likes

I find it challenging to understand the use case for having profile/algorithmic info in the CID or (even more of a stretch) metadata in a root node or a metadata node hanging off the root.

You have the original data, and you have a CID you want to match. But you don’t have info on how that CID was generated (otherwise you could replicate it by applying the same profile). You don’t want to fetch the DAG (because if you did you can deduce whether it matches regardless of how it was chunked or what type of nodes were used etc.). But you are OK with either: large CIDs; fetching the root node; or fetching the root node and another node. And then your tool would come back with: yep that’s the right CID or no, I came up with this other CID.

Do I have this right?

I’d like to underline something brought up by @stebalien - “the data may have evolved over time.” A project I use non-professionally to incrementally update an IPFS version of a directory tree… when I change how a node is arranged (usually replacing a CBOR subtree with a link in order to fit within my preferred block size) I don’t touch any part the tree that’s not a direct ancestor of the block that needed to be changed.

What if one day someone did something similar, but was smart about it, so they used a chunker that preferences early bytes of a file for Video, but uses something more standard for text and the nodes used for directories shift based on the size of the directory and… do all of those threshold and switches need to be encoded in the profile, and if so is the profile now complicated enough that we don’t want it shoved into the CID? Perhaps if it’s a metadata node you could repeat the node in subtrees where the decision changes, but then the verifier still needs to fetch an arbitrarily fraction of the DAG - why not get all of it?Are the tradeoffs really worth it?

It might it make sense to think of CAR files as specific to the former set of use cases and alien to the latter, right? I’m not sure a CAR file has to be referent of every CID, but it’s also a sensible default for many usecases (and as such worth refining/iterating/hardening). I love this framing, though, because be have to balance 1 against 2. If only 1 mattered, maximally-verbose profiles being mandatory at the “top level” or ingress/egress points of file-based systems would make perfect sense to mandate as universal, while I am partial to not breaking any userspace in IPLD land. The tricky bit is how much flexibility to retain in terms of laying tracks in front of the IPLD-native train… it is possible to err too far on the side of the #2 use cases as well.

Maybe the trick here would be to have profiles clearly defined at time of CID generation/encoding/ingress, but not strictly coupled to CID[v2] OR to CAR files? Profiles are worth elaborating in more detail anyways, is my intuition.

In @alanshaw (of Storacha’s) 4-year-old, archived-and-moved-into-/ipfs-inactive/-github-org library “ipfs-hash-only”, there were many many options exposed for how to generate CIDs-- compare this to the four properties tweaked by kubo’s preconfigs! I remember discussing this with Alan in Istanbul, not everything this API allows one to manually tweak needs to be manually tweakable, but perhaps it’s worth breaking out each profile into Chunking strategy, DAG-Building strategy, etc? (and of course, having “null” be an option for both!)

Yes, but the convo is about how it is not possible to readily reproduce a CID from “a piece of data” without more information that is currently stored nowhere (and could be stored in the CID).

I agree. One problem is how the ipld encoding (protobuf) and the file format (unixfs) are coalesced into a single ipld codec in this case (unlike cbor etc.). So file-level metadata could go into the ipld codec if we understand codecs to mean file formats and not just ipld data blob encoding. Right?

1 Like

i would also add that one recurring theme in these discussions is that CIDs that point to a DAG of IPLD and CIDs that point to a logical unit of UnixFS files/directories are fairly distinct use-cases and mixing up the codecs leads to a lot of confusion and implicit assumptions. perhaps it’s a good idea to keep a cleanup of multicodec in mind as one possible step in defining more robust profiles. For instance, registering a new codec for a UnixFS with a few invariants or config variables hardcoded, or a new codec for some more verbose/explicit UnixFS envelope/metadata file, whether that be CARvNext or something different altogether, would go a long way in making a profile for a more foolproof UnixFS usage…