@danieln, after conducting some testing on my end using test-cid-v1-wide, I’m observing a difference in CIDs when a file exceeds 1024 MB.
✅ MATCH for file: file_0_5MB (CID: bafkreie6drn3hggb3ruptdvqlahec5grhjv5m4it3sjvsm7m74us5kbofe)
✅ MATCH for file: file_1023MB (CID: bafybeigggqgfyhwr6okpc2w2v32tu7qczcurpj6j4hii6f4gxjmuot2kci)
✅ MATCH for file: file_1024MB (CID: bafybeih6ciohqseh6blceg35litzlib5we4nqudkeafbp54uu3zh7ld26a)
❌ MISMATCH for file: file_1025MB
IPFS CID: bafybeignp2eaklnbejnlcrxaldpiuoc63tk63vdsokleegajxpvczzxiau
Singularity CID: bafybeia2jsxebrhwuehoptuhpmhmlxhot74nalyihzud2uufosptoakjyu
✅ MATCH for file: file_173MB (CID: bafybeibvtg6kjfyibyej47xr32bg357uh2xffelgricuxziqasdhiyp5ke)
✅ MATCH for file: file_174MB (CID: bafybeig66jfwvfifkpzodebqook26gxcvhlvkqsak35rgvxm2izbob65oy)
✅ MATCH for file: file_175MB (CID: bafybeigwbdeibl3jcugnicgahiqjvdd6f4vyk5sip4ryepk25rll5zi3l4)
✅ MATCH for file: file_1MB (CID: bafkreibksrll5wy37k5z5roemmzciocovqgi4k742a6yz75lpmsgpfukrm)
I tried experimenting with some of the other import options for the ipfs add command, but that didn’t seem to resolve the issue.
I created a Docker image that you can use to recreate it on your end.
Here’s a link to a zip file containing the output CAR files from Singularity and JSON files with CIDs of the content as chunked from IPFS and Singularity.
I’m not sure if you have experience with Singularity or are familiar with the tool, but extracting content from CAR files can be somewhat complicated.
In my experience, tools like go-car are not able to extract the content from the CAR file. That’s because both CAR files are needed to extract the content.
The dag piece type represents the root CID of the content, with the corresponding IPLD content organized into the data piece type. I don’t fully understand the explanation for why it was designed that way, but if you want to extract the content, you can do so by uploading the CAR files into IPFS. First, add the data type, and then the dag.
// Package balanced provides methods to build balanced DAGs, which are generalistic
// DAGs in which all leaves (nodes representing chunks of data) are at the same
// distance from the root. Nodes can have only a maximum number of children; to be
// able to store more leaf data nodes balanced DAGs are extended by increasing its
// depth (and having more intermediary nodes).
In Singularity’s DAG, the last leaf node is not as the same distance from the root as the others.
The balanced layout creates a balanced tree of width ‘max width’. The tree is formed by taking up to ‘max width’ chunks from the chunk stream, and creating a unixfs file node that links to all of them. This is repeated until ‘max width’ unixfs file nodes are created, at which point a unixfs file node is created to hold all of those nodes, recursively. The root node of the resultant tree is returned as the handle to the newly imported file.
It could be worded much better but I think it matches what our implementation does:
Add chunks to a node until max width reached
At which point do the same but with a different node.
Create a unixfs node that links to “those nodes” (meaning the nodes linking to the chunks, not the chunks directly)
Is it possible to adapt your implementation at this point?
I believe OP is a downstream user of Singularity, and they’ve already opened an issue on Singularity’s repo pointing to this issue, so the implementation ball is in the right court already.
Is it worth tweaking the spec language to be more explicit? Am I understanding correctly that the only divergence of interpretation was about whether link #1025 needed to be nested at the same depth as 1-1024 were? Happy to open a PR (on the PR…) for the UnixFS spec to make this paragraph more explicit if it would help future UnixFS implementers.
@hector, thanks again for your reply and pointing me in the right direction! As @bumblefudge noted, I created an issue and spoke with the repo maintainer about this. I’m going to try and take a stab at implementing a solution
Actually, I went to open a PR and realized Lidel’s overhaul of UnixFS spec PR is still open, and I couldn’t quite figure out where to stick the “layout” section. If that PR were merged today as-is, the whole layout section quoted above would be removed, and I didn’t see treatment of dag-width and dag-layout in what replaces it… do we need to add it somewhere? Happy to help but I’m a little lost in the order-of-operations. I guess I could make a PR on Lidel’s branch adding in the layout stuff as best I can?