Distributed filesystem with 50TB of data in IPFS.. Doable?

Hi,
This looks like an amazing project. I have a need and was wondering, before I run down the rabbit hole. Can I get some feedback on what I am trying to do and if IPFS is suitable.

I have a Server with 50TB of data. Very large files. I want to share a subset of the 50TB files, Say 1.5TB to other locations. I will be adding and updating what needs to be shared weekly and what needs to be at each client weekly.

Could I add the 50TB to an IPFS file system, then on the client’s, PIN the files I want at each client. Then the P2P nature of the system would effectively torrent the files needed at each location.
Am I understanding this correctly as a capability of the implementation?

Is it wise to share a 50TB sized file system?

Can I monitor the status of how much needs to be transferred and how long for each client?

Can I have a virtual file system that only shows the PINed files at the client side?

Thanks,
James

3 Likes

Surprised, no one seems to have a commoent on this…

Can some one at least tell me it the idea of using it to decide what files to distribute to what node via pinning them is a good idea?

Thanks,
James

Hi James,

Maybe you want to check ipfs-cluster

I’ll try to take a shot.

You can do this using IPNS by updating the IPNS address with the newest hash each week, but you might need to set up a script or something on the clients to pin the new hash that IPNS points to every week and optionally unpin older hashes if you don’t want those files hanging around.

That’s correct.

You might currently have to do some fine-tuning if trying to pin 1.5TB at a time, but I’m under the impression that some performance fixes are coming in v0.4.11. If only a small subset of files are changing each week, adding the data to IPFS should be much faster after the initial add. By default, I’d expect you to see excessive bandwidth utilization while adding and pinning very large datasets.

You will probably also want to use IPFS filestore to add the files into IPFS since by default any added content gets copied into IPFS’ managed datastore; if you don’t use the filestore, you’d consume ~1.5TB of local storage (maybe less, depending on if there are duplicate blocks) for every 1.5TB of content added.

There might be a way to get some information on transfer progress, but I’m not aware of anything currently that would be similar to the visualizations available in torrent clients or P2P file transfer tools like Resilio Sync.

In general, not that I’m aware of. IPFS doesn’t treat content cached locally from files available remotely. For any given IPFS hash you can interact with it as if it’s in the local filesystem by setting up an IPFS mount and navigating to /ipfs/<hash>. While it might seem like it should work, doing something like ls /ipfs/ on the IPFS mount will not list only pinned IPFS hashes or locally cached hashes.

1 Like

These instructions on Replicating Large Amounts of Data with Minimal Overhead might be useful. I think there are new cleaner ways to do some of the optimizations mentioned in those instructions, but the instructions are still valid. Note: Those instructions recommend using ipfs-pack to publish the data – that might not work for your use case, since pack was originally designed for datasets that will remain static over time (ie. archiving a snapshot of a government dataset). Instead of using ipfs-pack, as @leerspace mentioned, IPFS filestore might help here. Filestore also has an experimental --no-copy option that lets you index data in place rather than copying it into the ipfs repo.

Disk write times will be one of your main bottlenecks when adding that volume of data.

@whyrusleeping do you have any advice to add?

1 Like

For datasets that large, it will be recommended to use the filestore (as @leerspace linked above) and disable reproviding with ipfs config Reprovider.Interval "0". Then, do the add without the daemon running, something like: ipfs add -r --nocopy mydataset.

There are some more experimental things you could try, for example, using blake2b instead of sha256 as the hash function by adding --hash=blake2b-256 to the add call.

1 Like

Thank for everyones suggestions. Interesting stuff.
Been reading up on the suggested additions etc.
There does not appear to be an easy way to monitor the current synchronization of nodes.
i.e. how much data still needs to transfer to achieve sync, speed, timeframe. I suppose it’s not really designed for that.
But I feel I am at a stage where I have to set up a test platform and play around.

Thanks,
James

@jamiegau any update on the success of your use case?