Should we profile CIDs?

SethDocherty · June 27, 2025, 12:49am

@danieln, after conducting some testing on my end using test-cid-v1-wide, I’m observing a difference in CIDs when a file exceeds 1024 MB.

✅ MATCH for file: file_0_5MB (CID: bafkreie6drn3hggb3ruptdvqlahec5grhjv5m4it3sjvsm7m74us5kbofe)
✅ MATCH for file: file_1023MB (CID: bafybeigggqgfyhwr6okpc2w2v32tu7qczcurpj6j4hii6f4gxjmuot2kci)
✅ MATCH for file: file_1024MB (CID: bafybeih6ciohqseh6blceg35litzlib5we4nqudkeafbp54uu3zh7ld26a)
❌ MISMATCH for file: file_1025MB
   IPFS CID:        bafybeignp2eaklnbejnlcrxaldpiuoc63tk63vdsokleegajxpvczzxiau
   Singularity CID: bafybeia2jsxebrhwuehoptuhpmhmlxhot74nalyihzud2uufosptoakjyu

✅ MATCH for file: file_173MB (CID: bafybeibvtg6kjfyibyej47xr32bg357uh2xffelgricuxziqasdhiyp5ke)
✅ MATCH for file: file_174MB (CID: bafybeig66jfwvfifkpzodebqook26gxcvhlvkqsak35rgvxm2izbob65oy)
✅ MATCH for file: file_175MB (CID: bafybeigwbdeibl3jcugnicgahiqjvdd6f4vyk5sip4ryepk25rll5zi3l4)
✅ MATCH for file: file_1MB (CID: bafkreibksrll5wy37k5z5roemmzciocovqgi4k742a6yz75lpmsgpfukrm)

I tried experimenting with some of the other import options for the ipfs add command, but that didn’t seem to resolve the issue.

I created a Docker image that you can use to recreate it on your end.

The sample data generation configuration profile I used was dataset1, so feel free to remove the others for a quicker turnaround test.

danieln · July 1, 2025, 2:48pm

Thanks for reporting @SethDocherty.

Can you please share a link to the CAR file produced with Singularity and the CID?

SethDocherty · July 1, 2025, 4:31pm

Hi @danieln, thanks for getting back to me!

Here’s a link to a zip file containing the output CAR files from Singularity and JSON files with CIDs of the content as chunked from IPFS and Singularity.

I’m not sure if you have experience with Singularity or are familiar with the tool, but extracting content from CAR files can be somewhat complicated.

Singularity creates two output CAR files:

In my experience, tools like go-car are not able to extract the content from the CAR file. That’s because both CAR files are needed to extract the content.

The dag piece type represents the root CID of the content, with the corresponding IPLD content organized into the data piece type. I don’t fully understand the explanation for why it was designed that way, but if you want to extract the content, you can do so by uploading the CAR files into IPFS. First, add the data type, and then the dag.

hector · July 2, 2025, 10:09am

Thank you:

The difference is the following:

(IPFS) bafybeignp2eaklnbejnlcrxaldpiuoc63tk63vdsokleegajxpvczzxiau:
- Intermediary node1 → 1024 links
- Intermediary node2 → 1 link (last block)
Singularity: bafybeia2jsxebrhwuehoptuhpmhmlxhot74nalyihzud2uufosptoakjyu
- Intermediary node 1 → 1024 links
- last block

Both export to the same file (by chance or not).

Our “balanced” dag builder documentation says:

// Package balanced provides methods to build balanced DAGs, which are generalistic
// DAGs in which all leaves (nodes representing chunks of data) are at the same
// distance from the root. Nodes can have only a maximum number of children; to be
// able to store more leaf data nodes balanced DAGs are extended by increasing its
// depth (and having more intermediary nodes).

In Singularity’s DAG, the last leaf node is not as the same distance from the root as the others.

There’s a UnixFS Spec in the repo (specs/UNIXFS.md at main · ipfs/specs · GitHub):

The balanced layout creates a balanced tree of width ‘max width’. The tree is formed by taking up to ‘max width’ chunks from the chunk stream, and creating a unixfs file node that links to all of them. This is repeated until ‘max width’ unixfs file nodes are created, at which point a unixfs file node is created to hold all of those nodes, recursively. The root node of the resultant tree is returned as the handle to the newly imported file.

It could be worded much better but I think it matches what our implementation does:

Add chunks to a node until max width reached
At which point do the same but with a different node.
Create a unixfs node that links to “those nodes” (meaning the nodes linking to the chunks, not the chunks directly)

Is it possible to adapt your implementation at this point?

bumblefudge · July 9, 2025, 12:04pm

I believe OP is a downstream user of Singularity, and they’ve already opened an issue on Singularity’s repo pointing to this issue, so the implementation ball is in the right court already.

Is it worth tweaking the spec language to be more explicit? Am I understanding correctly that the only divergence of interpretation was about whether link #1025 needed to be nested at the same depth as 1-1024 were? Happy to open a PR (on the PR…) for the UnixFS spec to make this paragraph more explicit if it would help future UnixFS implementers.

hector · July 9, 2025, 12:24pm

Yes, totally. The spec is not very good on this point. It even forgets completely about the “trickle” DAG, which it mentions.

The spec must have been written based on the much earlier implementation, which is way more explicit:

SethDocherty · July 9, 2025, 11:01pm

@hector, thanks again for your reply and pointing me in the right direction! As @bumblefudge noted, I created an issue and spoke with the repo maintainer about this. I’m going to try and take a stab at implementing a solution

bumblefudge · July 10, 2025, 8:45am

Actually, I went to open a PR and realized Lidel’s overhaul of UnixFS spec PR is still open, and I couldn’t quite figure out where to stick the “layout” section. If that PR were merged today as-is, the whole layout section quoted above would be removed, and I didn’t see treatment of dag-width and dag-layout in what replaces it… do we need to add it somewhere? Happy to help but I’m a little lost in the order-of-operations. I guess I could make a PR on Lidel’s branch adding in the layout stuff as best I can?

Topic		Replies	Views
Should we profile IPLD?	8	163	October 22, 2024
Lean IPFS implementations Ecosystem and Usage use-cases-and-apps	3	100	December 25, 2024
I Like Big Blocks And I Cannot Slice Protocol	5	145	June 5, 2025
Data transformation tool on top of IPFS Ecosystem and Usage	12	428	November 30, 2021
CID persistence	9	455	September 14, 2023

Should we profile CIDs?

Related topics