Hi, I work in the central research data office in a major German research centre. We are exploring new ways to disseminate and archive scientific data output. Currently, we run a Dataverse instance which stores the data in an object store (with S3 API). Part of this data is marked âpublishedâ and could be made available by, for instance, IPFS.
My main concerns are:
- Permanence. We need to make sure that data is available under the same name for at least ten years. Much longer would be greatly appreciated.
- Performance. The data needs to be shared with others as efficiently as possible.
- Non-redundancy. I donât want to have duplicates of the (very large) data just for operating another service.
The following questions/issues arise for me. The âipfs add --help
â command says:
While not guaranteed, adding the same file/directory with the same flags will almost always result in the same output hash.
The âalmostâ bothers me. I really must have the very same CID many many years later for the same dataset. Otherwise, I would have to use a PID service like DOI in front of the CID.
Do I understand it correctly that a CAR file may solve this? In other words, we convert every dataset into a CAR and can guarantee the same CID for the dataset as long as we do not lose the CAR? (And IPFS keeps the ability to import our CAR version.)
Unfortunately, Dataverse cannot deal with CAR files yet but since we need a backup site anyway, I am inclined to duplicate our data as CAR files to another storage site and run an IPFS daemon there.
But this leads to the second question: There is no âipfs dag import --nocopy
â yet, correct? So we had the data three times: Once for Dataverse, once in CAR files and once in the IPFS repository? Is there any way to avoid duplication on the IPFS site and still guaranteeing stable CIDs?