Adding some additional context as I’ve been diving into this
The state of UnixFS in JavaScript/TypeScript
- In JS land, we have two UnixFS implementations:
Most developers use higher-level libraries that depend on these, and have slightly different defaults
Defaults and naming profiles
Naming things is hard. Moreover, understanding the trade-offs in different UnixFS options is far from obvious to newcomers sold on content addressing.
In thinking about this for users not familiar with internals, the conclusion I came to is that we should lean more heavily on defaults to guide users to the happy path, which ensures CID equivalency given the same inputs.
As for naming, my initial suggestion was to name the profile unixfs-v1-2025
denoting the year it was ratified. This is grounded in the insight that consensus around conventions can change over time, though not that often.However, I realise the shortcomings of this approach, it carries no information about the specifics of the profile, so the actual parameters will likely need to live in the spec. Finally, with time, this might feel “outdated”.
I should also note that I don’t think CIDv2 packing this information is pragmatic. This will be a breaking change that I don’t think the ecosystem will embrace, leading to more fragmentation and confusion.
Another approach could be to name profiles based on the key UnixFS/CID parameters:
- CID version
- hash function
- layout, e.g. balanced, trickle
- chunk-size
- dag width
- raw blocks
- HAMT threshold (I’d need to dive deeper into whether there’s that much variance around this)
For example v1-sha256-balanced-1mib-1024w-raw
.
Long and convoluted, but encapsulates the information.
HAMT and autosharding
HAMTs are used to chunk UnixFS directories blocks that contain so many links that result in the block being above the a certain chunk size.
Almost all implementations use HAMT fanout of 256. This refers to the number “sub-shards” or “ShardWidth”
Implementations vary in how determine whether to use a HAMT. Some support autosharding, where they automatically shard based on an estimate of the block size (counting the size of PNNode.Links
).
- Kubo/Boxo uses a size based parameter (
HAMTShardingSize
) of 256KiB, where 256KiB is an estimate of the blocksize based on the size of all links/names. An estimate is used (rather than the actual blocksize) to avoid needing to serialise the Protobuf just to measure size.
-
- go-unixfsnode (used by go-car and extensively by filecoin ecosystem) also autoshards like Boxo/Kubo
- Helia and ipfs/js-ipfs-unixfs use the same approach as Kubo (discussion, and this comment). The config is
shardSplitThresholdBytes
which defaults to 256KiB
- ipld/js-unixfs which the Storacha tools:
ipfs-car
and w3up
depend on, doesn’t implement autosharding (open issue. Consumers of the library like ipfs-car and w3up trigger HAMT sharding once a directory has 1000 links.
Other sources of CID divergence
Empty folders
- Kubo’s
ipfs add
command and Helia’s unixfs.addAll
(with globSource
) add empty folders.
- w3 and ipfs-car both ignore empty folders (they both depend on storacha/files-from-path which only returns files and ignores empty folders).
This means that if your input contains empty folders that resulting CID will be different, even if all other settings are the same.
This brings up the question of why you would even include empty directories in the DAG.
My suggestion would be to account for this as we define the profile settings whether empty directories are included