Hello,
I’m about to create a new ipfs-cluster for storing ~9TB of data from 3.5 million files. Previously I was running an IPFS node and adding new files from a local dir, but I started having problems to keep it updated over time (the collections is growing by ~1k files a day, ~6GB). So this time I wanted to describe the problem here first, and ask for some advice on how to maintain it properly =]
More details:
- Since it’s a lot of files, what is the best store to use in this case? BadgerDS? Before I was using
ipfs add --pin --nocopy --fscache {file}
to pin new datasets and avoid creating two copies of the data, but it was a bit annoying to manage when I had to move data around (like when I needed larger HDs for storage). - The initial node in the cluster has 16GB of RAM, a 256GB SSD and 16TB HDD (2x8TB disks). I was planning to store the blocks in the HDDs, but keep other metadata in the SSD. Is this enough/makes sense?
- Use
crdt
as consensus, since additional nodes will be followers and only the initial node will pin more data to the cluster. - What options should I use with
ipfs add
to have good future compatibility? (which hash algorithm, CIDv1, …) - The dir structure looks like this:
Mirroring this structure in IPFS is useful because it is easy to point IPNS to the root, and then accessing a different file can be done by changing. ├── wort-genomes │ └── <1.16M files> ├── wort-img │ └── <65k files> └── wort-sra └── <2.3M files>
wort-sra/ERR4020100.sig
towort-sra/SRR11555563
. But… It never really worked, because thatwort-sra/
dir has way too many files beneath it, and just never finishes loading. I was considering not building these dirs for now (and using another API for providing the mapping from dataset ID to IPFS hash), but is there any solution to this that doesn’t involve splitting the directory into smaller dirs (like git objects are represented, for example)? Is IPLD a viable solution for building these dirs with millions of children?
Thanks =]
P.S: this cluster will support the storage/distribution of the data from https://wort.oxli.org/, a system for indexing public genomic databases to allow large scale sequencing data search and comparison. I wrote more about it on chapter 5 of my dissertation.