I find it challenging to understand the use case for having profile/algorithmic info in the CID or (even more of a stretch) metadata in a root node or a metadata node hanging off the root.
You have the original data, and you have a CID you want to match. But you don’t have info on how that CID was generated (otherwise you could replicate it by applying the same profile). You don’t want to fetch the DAG (because if you did you can deduce whether it matches regardless of how it was chunked or what type of nodes were used etc.). But you are OK with either: large CIDs; fetching the root node; or fetching the root node and another node. And then your tool would come back with: yep that’s the right CID or no, I came up with this other CID.
Do I have this right?
I’d like to underline something brought up by @stebalien - “the data may have evolved over time.” A project I use non-professionally to incrementally update an IPFS version of a directory tree… when I change how a node is arranged (usually replacing a CBOR subtree with a link in order to fit within my preferred block size) I don’t touch any part the tree that’s not a direct ancestor of the block that needed to be changed.
What if one day someone did something similar, but was smart about it, so they used a chunker that preferences early bytes of a file for Video, but uses something more standard for text and the nodes used for directories shift based on the size of the directory and… do all of those threshold and switches need to be encoded in the profile, and if so is the profile now complicated enough that we don’t want it shoved into the CID? Perhaps if it’s a metadata node you could repeat the node in subtrees where the decision changes, but then the verifier still needs to fetch an arbitrarily fraction of the DAG - why not get all of it?Are the tradeoffs really worth it?