Private IPFS with > 200GB Files

Hi all,

I spent the last weeks on assessing IPFS for the exchange of large data (> 200GB) on a private swarm. The main goal is to make these large files available to a small cluster of machines. I managed to set the swarm up and share the data over IPNS. However, the access times were somewhat behind my expectations. I used kubo 0.19.0

The setup I used was two machines (A & B) connected by 1GBit network. I added the file/directory (300 GB) on machine A to the IPFS published it through IPNS and accessed the respective key on machine B.
As a baseline to my measurements I used a direct SCP transfer of the file/directory in question. I did several measurements to average out differences in the repeated transfers but the variance in timings was very small. I also used different chunk-sizes for IPFS’s block storage.

My goal is to access the contents of the file/directory form other software that is not aware of IPFSs API, so I’m a bit stuck with “ipfs get” as I never got to mount IPFS in a reliable way to access it just as any other filesystem. Maybe, I did not understand all of the options IPFS gives me here. If you have some advice on how to do it better, please tell me.

Here are my findings in comparison to the scp baseline

  • Accessing (ipfs) the file on machine B took 2.79 times the time of scp (6h 51m vs 2h 27m) with 256kB blocks and 1.55 times the time (3h 48m vs 2h 27m) for 1MB blocks.
  • I assumed most of the time was spent in transferring the hashed blocks so I called the “ipfs get” command a second time: 256kB blocks: still 0.75 time of the scp (1h 51m vs 2h 27m) and 1MB 0.72 times the scp time (1h 47m vs 2h 27m).
  • I found out that one can use “ipfs pin add” to transfer the MFS from machine A to machine B but this took almost the same time as scp (2h 26m vs 2h 27m) and I would need to call get after that to access the contents of the directory leaving me still with additional 1h 50m.

Is there something I could do more effective here? How would I make IPFS work efficiently with large files?

Under the assumption that my actions are the expected way to interact with IPFS in a private swarm, I would assume that accessing the data after caching (the second time) would need to be much faster. I understand that maintaining the Merkle-dag doesn’t come for free but I would expect that recreating the once downloaded data would rather take minutes not hours. Is there something I forgot to configure?

Best and thank you.

1 Like

This is interesting, I wonder what the best way to optimize this is. I believe changing the datastore should effect performance quite a bit. Nailing this down could be useful, it’s a very reasonable and common use-case, and putting a guide to it in the IPFS docs / Kubo README could be useful :thinking:.

Thank you for your reply. I will have a look into that!

Are you aware of the IPFS Clusters? It might or might not be what you need depending on your use case.
It is also developed by Protocol Labs.