IPFS for publishing research data: CAR files?

Hi, I work in the central research data office in a major German research centre. We are exploring new ways to disseminate and archive scientific data output. Currently, we run a Dataverse instance which stores the data in an object store (with S3 API). Part of this data is marked “published” and could be made available by, for instance, IPFS.

My main concerns are:

  1. Permanence. We need to make sure that data is available under the same name for at least ten years. Much longer would be greatly appreciated.
  2. Performance. The data needs to be shared with others as efficiently as possible.
  3. Non-redundancy. I don’t want to have duplicates of the (very large) data just for operating another service.

The following questions/issues arise for me. The “ipfs add --help” command says:

While not guaranteed, adding the same file/directory with the same flags will almost always result in the same output hash.

The “almost” bothers me. I really must have the very same CID many many years later for the same dataset. Otherwise, I would have to use a PID service like DOI in front of the CID.

Do I understand it correctly that a CAR file may solve this? In other words, we convert every dataset into a CAR and can guarantee the same CID for the dataset as long as we do not lose the CAR? (And IPFS keeps the ability to import our CAR version.)

Unfortunately, Dataverse cannot deal with CAR files yet but since we need a backup site anyway, I am inclined to duplicate our data as CAR files to another storage site and run an IPFS daemon there.

But this leads to the second question: There is no “ipfs dag import --nocopy” yet, correct? So we had the data three times: Once for Dataverse, once in CAR files and once in the IPFS repository? Is there any way to avoid duplication on the IPFS site and still guaranteeing stable CIDs?

1 Like

Awesome. What kind of data is it? (the topic, not the format, although that would be interesting to know as well)

Just kind of curious. Where does the requirement for “same name” come from and what exactly qualifies as being the same name. If I were to zip it into a file and the file name not change but the url you serve it from qualify. I’m trying to get a better idea of what exactly the requirement is.

Similar to the last question. What is your definition of performance and/or efficiency?

I’m guessing this non-redundancy you’re referring to is for yourself. ie. you don’t want to have two copies of the data, one for data verse and a second copy for IPFS.

Yes, it’s basically saying that having the file does not guarantee that you can reproduce the CID in the future by adding it a second time. In general the CID produced is effected by the choice of hash and chunker but sometimes there can be other changes that effect it as well. I believe there were recent changes to the default cid and directory layouts that would have caused a change. I think storing a CAR file would be safe but then you’re saving another copy but it would be like a backup so I don’t know if that qualifies under your non-redundancy requirement.

You could always publish it as a IPNS or DNSLink record to get your same name requirement and then it wouldn’t matter if the CID changed. (this is already long so I’ll let you follow up if you have any questions about IPNS or DNSLink)

Just off the top of my head I’d mount the s3 directory with fuse and then add your content using file store, then use DNSLink to point to the generated IPFS CID and avoid IPNS.

Note that the CID is the hash of the root node in a Merkle-DAG that represents your files/folders. The caveat here is that perhaps the way this DAG is built changes (i.e a default is adjusted), thus the DAG might change if you re-add the data and recreate that dag… The original CID still points to the original DAG which still represents the original data and that is stable (as long as it is retrievable). Also, if you really want to be able to reconstruct the original DAG later, it is good to remember how it was created the first time (right now that amounts to: balanced vs trickle dag, block size, cidv1 vs v0, raw-leaves yes/no).

A CAR file is a serialization of the Merkle-DAG, so yes.

You could consider the IPFS directory as a collection of CAR files, as you can always re-export to CAR etc. In that sense you would have to safeguard the IPFS storage and the pinset rather than a bunch of CAR files. Other than that I don’t know of a way of providing CAR files directly over IPFS without importing them.

1 Like

This seems to come up fairly often. It would be kind of cool if you could have an IPLD resource that listed all the options used to add a file including the defaults and allow you to reference that CID when adding files to support reproducible adds. Something like ipfs add --template <CID>

The data is produced by our scientists. The Research Centre JĂźlich has Earth and environmental science, medicine, material science, high-performance computing, among other disciplines.

The data will be referenced to from RDF triples. Therefore, we need stable URIs. And Good Scientific Practice requires data to be retrievable for at least 10 years. The longer, the better.

Limited only by connection bandwidth. In my tests so far, IPFS performed much worse. I tested over a 100 MBit connection, but speed was only 1/20th of that. Moreover, time to be able to find a CID in the DHT could be hours after the “ipfs add”. (Apparently, many mal-configured NATs seem to be one cause for the latter.)

As far as speed is concerned, I certainly have to test more.

Currently, we do exactly that (using s3fs). This setup has its own problems, as “ipfs filestore verify” becomes unbearably slow, and “ipfs add -r --nocopy” lacks a “follow symlinks” option to be able to rearrange a bit.

This sounds like a good approach. Will the CID of a given DAG never change in a repo, even across IPFS version migrations?

The CID is a hash and it references a unique block so it just cannot change.

2 Likes

As pointed out by others in this thread, the CID is the immutable pointer to the root of the DAG for the file tree you construct and it will never change.

The reason the word almost is used is because given the same input (file tree that is added to IPFS), you can get a different CID depending on the parameters that include:

  • balanced vs trickle dag
  • block size
  • cidv1 vs v0
  • raw-leaves (boolean choice that for most purposes should be true)

In other words, given the same input and same add parameters you will always get the same CID.

Once files/dirs are added to IPFS you can always export a CAR file using the CLI or even fetch CAR files from an IPFS gateway using HTTP.

CAR files are just a serialisation format and a convenient way to move around content addressed data.

Limited only by connection bandwidth. In my tests so far, IPFS performed much worse. I tested over a 100 MBit connection, but speed was only 1/20th of that. Moreover, time to be able to find a CID in the DHT could be hours after the “ipfs add”. (Apparently, many mal-configured NATs seem to be one cause for the latter.)

It’s worth pointing out that IPFS has two responsibilities in this contenxt:

  • Content routing/discovery/publishing: this is the process by which the blocks/CIDs are published to the IPFS DHT and then found by other nodes in the IPFS network.
  • Content retrieval: the process by which blocks/CIDs are shared between peers using the Bitswap protocol.

Currently, content publishing is relatively slow with the default settings of Kubo (formerly known as go-ipfs). For some tips on improving this aspect, check out this blog post.

As to the performance of content retrieval with Bitswap, it should be noted that Bitswap generally performs best when multiple nodes are providing the blocks. As an example with some concrete metrics, check out this blog post about how Netlify used Bitswap to improve their container image distribution times.

@bronger, we have a bunch of tools, services, and programs to support the onboarding of open scientific data to IPFS and Filecoin which might be relevant for you, e.g. https://estuary.tech/ and Filecoin for Large Data

Note that this is a significantly stronger statement than in the output of “ipfs add --help”: “adding the same file/directory with the same flags will almost always …” – This contains the same preconditions as your statement but it says “almost always” instead of “always”.

As I said, the weak performance may be a misconfiguration on our side. I have to revisit these tests. However, 1/20 of the speed of the connection bandwidth would be intolerable even if the receiver has to download only from one site.

1 Like

Note that this is a significantly stronger statement than in the output of “ipfs add --help”: “adding the same file/directory with the same flags will almost always …” – This contains the same preconditions as your statement but it says “almost always” instead of “always”.

Indeed the docs say:

Finally, a note on hash determinism. While not guaranteed, adding the same
file/directory with the same flags will almost always result in the same output
hash. However, almost all of the flags provided by this command (other than pin,
only-hash, and progress/status related flags) will change the final hash.

@lidel
Is there anything else that could lead to a different CID given the same parameters to ipfs add?

No, I believe ipfs add --help exposes all currently existing flags that could impact the final CID.

But I agree with @bronger that ipfs add --help “almost always” is painfully vague.

Opened a PR to clarify things that impact CID determinism during ‘ipfs add’: docs: clarify CID determinism in add command by lidel · Pull Request #9128 · ipfs/kubo · GitHub would appreciate feedback on how we can explain this better. :pray:

2 Likes

Can you really say, “Given the same input and same add parameters, you will ALWAYS get the same CID.” without also including the Kubo version? There is the possibility that some parameters can be dropped making it impossible to recreate the CID even though it’s technically feasible.

" write down and/or always explicitly set the same ‘add’ parameters" I get what you’re going for here and you’re using the term “write down” loosely here but this is why I suggested having some sort of input for “add” flags. I was thinking about it and an output for add flags would be nice as well. Call it a prototype, template, whatever. Something like ipfs add --create-template => CID of IPLD listing all flags and ipfs add --template <CID>

1 Like

@bronger in addition to storage on IPFS/Filecoin, I would love to connect with you on decentralized processing on the data in IPFS. At project Bacalhau (via Protocol Labs) we are actively seeking to help researchers improve reproducibility via low cost open compute. Two recent research datasets on IPFS we are helping with include: EUREC4A and SOCAT)

In case your dataset involves data pipelines for processing, refining the data, we would love to connect to see if we can help augment your work. Would you be open to connecting briefly to discuss?

1 Like

Currently, I am overwhelmed by such certainly sensible requests because we are in a very early stage of evaluation, and quite frankly, with open outcome.

Our use case will be that we can store our pinned data indefinitely, but may benefit from the BitTorrent’ness of IPFS in the first days/weeks when a certain dataset is “hot”.

Moreover – but this is a very personal view – I don’t like the PIDs currently ubiquitous in research data management (mostly handle.net-based) and consider CIDs a viable alternative.

2 Likes