Distributed filesystem with 50TB of data in IPFS.. Doable?

jamiegau · August 28, 2017, 2:00am

Hi,
This looks like an amazing project. I have a need and was wondering, before I run down the rabbit hole. Can I get some feedback on what I am trying to do and if IPFS is suitable.

I have a Server with 50TB of data. Very large files. I want to share a subset of the 50TB files, Say 1.5TB to other locations. I will be adding and updating what needs to be shared weekly and what needs to be at each client weekly.

Could I add the 50TB to an IPFS file system, then on the client’s, PIN the files I want at each client. Then the P2P nature of the system would effectively torrent the files needed at each location.
Am I understanding this correctly as a capability of the implementation?

Is it wise to share a 50TB sized file system?

Can I monitor the status of how much needs to be transferred and how long for each client?

Can I have a virtual file system that only shows the PINed files at the client side?

Thanks,
James

jamiegau · September 4, 2017, 5:40am

Surprised, no one seems to have a commoent on this…

Can some one at least tell me it the idea of using it to decide what files to distribute to what node via pinning them is a good idea?

Thanks,
James

hermanjunge · September 4, 2017, 6:03am

Hi James,

Maybe you want to check ipfs-cluster

leerspace · September 4, 2017, 2:09pm

I’ll try to take a shot.

You can do this using IPNS by updating the IPNS address with the newest hash each week, but you might need to set up a script or something on the clients to pin the new hash that IPNS points to every week and optionally unpin older hashes if you don’t want those files hanging around.

That’s correct.

You might currently have to do some fine-tuning if trying to pin 1.5TB at a time, but I’m under the impression that some performance fixes are coming in v0.4.11. If only a small subset of files are changing each week, adding the data to IPFS should be much faster after the initial add. By default, I’d expect you to see excessive bandwidth utilization while adding and pinning very large datasets.

You will probably also want to use IPFS filestore to add the files into IPFS since by default any added content gets copied into IPFS’ managed datastore; if you don’t use the filestore, you’d consume ~1.5TB of local storage (maybe less, depending on if there are duplicate blocks) for every 1.5TB of content added.

There might be a way to get some information on transfer progress, but I’m not aware of anything currently that would be similar to the visualizations available in torrent clients or P2P file transfer tools like Resilio Sync.

In general, not that I’m aware of. IPFS doesn’t treat content cached locally from files available remotely. For any given IPFS hash you can interact with it as if it’s in the local filesystem by setting up an IPFS mount and navigating to /ipfs/<hash>. While it might seem like it should work, doing something like ls /ipfs/ on the IPFS mount will not list only pinned IPFS hashes or locally cached hashes.

flyingzumwalt · September 5, 2017, 2:47pm

These instructions on Replicating Large Amounts of Data with Minimal Overhead might be useful. I think there are new cleaner ways to do some of the optimizations mentioned in those instructions, but the instructions are still valid. Note: Those instructions recommend using ipfs-pack to publish the data – that might not work for your use case, since pack was originally designed for datasets that will remain static over time (ie. archiving a snapshot of a government dataset). Instead of using ipfs-pack, as @leerspace mentioned, IPFS filestore might help here. Filestore also has an experimental --no-copy option that lets you index data in place rather than copying it into the ipfs repo.

Disk write times will be one of your main bottlenecks when adding that volume of data.

@whyrusleeping do you have any advice to add?

whyrusleeping · September 5, 2017, 7:38pm

For datasets that large, it will be recommended to use the filestore (as @leerspace linked above) and disable reproviding with ipfs config Reprovider.Interval "0". Then, do the add without the daemon running, something like: ipfs add -r --nocopy mydataset.

There are some more experimental things you could try, for example, using blake2b instead of sha256 as the hash function by adding --hash=blake2b-256 to the add call.

jamiegau · September 5, 2017, 11:44pm

Thank for everyones suggestions. Interesting stuff.
Been reading up on the suggested additions etc.
There does not appear to be an easy way to monitor the current synchronization of nodes.
i.e. how much data still needs to transfer to achieve sync, speed, timeframe. I suppose it’s not really designed for that.
But I feel I am at a stage where I have to set up a test platform and play around.

Thanks,
James

jopasserat · August 20, 2018, 10:40am

@jamiegau any update on the success of your use case?

Topic		Replies	Views
Is using ipfs for large scale content distribution a good idea? Help	9	2521	October 10, 2017
Wrapping my head around IPFS basics for newbies ipns	9	1653	December 5, 2017
Storing a few TB of data? Help	5	1597	May 23, 2017
IPFS data sharing costs and limits js-ipfs	2	1440	June 22, 2018
Sharing files from their current locations Help	3	695	May 23, 2017

Distributed filesystem with 50TB of data in IPFS.. Doable?

Related topics