The problem here is establishing hash equivalency, sometimes across different systems. CIDs get a lot of flak here because it seems like they promise portability, but they fall short. The reason they fall short is they don’t include all the possible configuration required to reproduce the resulting hash if you had the raw input data.
So here my question is do we attempt to do a CIDv2 that packs all the information into the CID as @hector suggests (probably we should), or should we also establish a mechanism for talking about hash equivalency?
Ultimately, I think we need a way to talk about hash equivalency. The underlying problem is that most people expect the same raw data to produce the same CID (I imagine by default they expect a SHA256 of all raw data taken as a whole) which is simply not the case and never will be. Encoding the UnixFS params in CID v2 makes more CIDs rather than less. We will always have many different CIDs for the same data.
My suggestion is to introduce a level of trust through signed attestations of CID equivalency.
A data structure might look like this:
{
"original": "SHA256 of raw"
"unixFS": [
{
"CID": "root~cid"
"chunkingParams": {
// ....
}
},
// ... you could have more here
],
"blake3": "blake3~cid",
"pieceCID": "filecoin~piece~cid",
"attested_by": "some~pub~key~ideally~the~original~data~auther"
"signature": "signature-bytes"
}
I’m just spitballing a structure here – we actually use UCANs in web3 storage for some of this but I’m not super a fan since they’re actually just attestations. But hopefully the above illustrates exactly how many CIDs we might actually want to tie together – there are a bunch.
Of course now you’re trusting whomever created this attestation until you fetch the data. But ultimately, you’re always trusting before you fetch the data caveat some incremental viability. And, depending on the data itself, there may be a higher level of trust in the person who signed this data than fetching from a random peer. Personally, I have such an attestation for a Linux ISO signed by the pub key of the group that produces it, I’m inclined to relax my incremental verifiability requirements at transport time (and still verify incrementally against maybe a UnixFS tree).
Moreover, once you fetch the data, you might produce an additional attestation you sign, so now you have a bunch of people saying “these two are the same” and at some point you establish a decent level of trust.
Anyway that’s my 2c