How to tune a private IPFS swarm for large files?

I’ve been playing with IPFS private swarms as a background activity at my work for a few months, and am really excited about how we might use it. I’d like some help understanding its performance, and if there’s anything I can do to tune it.

Use case

My use-case is transferring largish files (~6-10 GiB) built in cloud compute to our on-premise lab equipment. I estimate that a naive implementation (i.e. without IPFS) will transfer ~175 TIB/day of data. To put that in perspective that would be a sustained rate of about 17.8 Gbps.

These large files are bootable disk images, such as the contents of Linux system running Debian. On a given day most of the disk images will be very similar to one another (perhaps 98% duplicate data between any two given disk images).

I’m excited to use IPFS as it will allow me to have my cake and eat it: any single user of the on-premise lab equipment can treat it as private by adding their just-compiled disk image to the swarm and instructing the equipment to boot from a given CID. The IPFS swarm will deduplicate the user’s disk image and likely only transfer the unique blocks across the WAN link because the duplicate blocks are likely already available in the lab - either on the equipment itself from a previous job, or from peers in the same racks which ran a related job.

I expect to have a single peer in the cloud and 100+ peers initially in the lab, one per item of equipment. I hope to grow this to all our lab equipment, so perhaps 600-1000 peers.

Performance

So far my experiments suggest IPFS cannot transfer data between two peers faster than ~250 Mbps (25MiB/s), which does not even remotely saturate our network links which are 10+ Gbps. I have not yet tried a scale test with a large number of peers. Should I expect that to be quicker?

Is there anything I can tune to increase throughput? Or reduce CPU load, if that is indeed my bottleneck. I benchmarked multihash performance and settled on using blake3 as its fastest on the embedded systems we’re using (which lack HW acceleration for SHA256).

The swarm is entirely private, and I find myself wondering if the default (256 kiB) and maximum (1 MiB) chunk sizes are too small to allow network flow control systems to reach full speed. Perhaps due to (e.g.) TCP window sizing? I’ve tried using 1MiB chunk sizes and it didn’t make a measurable difference.

I also wonder if enabling more concurrent traffic might help, i.e. larger want lists.

Any tips or ideas for experiments to try would be appreciated!

1 Like

The data-transfer issues are due to bugs in the go-bitswap implementation.
Given that you both of the nodes are trusted and fast, a single peer protocol such as graphsync or car files over http would be much better.
You could try my soft: GitHub - Jorropo/linux2ipfs: Small pipeline and extreme-performance oriented IPFS implementation to upload files and deltas to pinning services very fast., it would be fairly easy to add blake3 support.

It’s a bit buggy and doesn’t properly parallelise small files but is still much faster than Kubo, Iroh or whatever else.

I missed this, if your disk images are sparse, a software that do SEEK_HOLE and SEEK_DATA would help a lot.

Yes, they are. IPFS deduplicates the empty blocks, which (so far) has felt “good enough” for my purposes.

Yeah … it could be much better, the default of 256KiB, will require up to 512KiB of empty blocks to see a difference.
Linux2ipfs doesn’t deduplicate anything yet so it might actually be worst at this.

Given that you both of the nodes are trusted and fast, a single peer protocol such as graphsync or car files over http would be much better

The attraction of IPFS is that it would autonomously manage data locality, i.e. pulling from peers in the lab when possible, and degrading gracefully to a more-or-less straight push from the data source to destination. Which is to say that the transport layer always does the right thing, and users of the lab equipment don’t need to micro-manage data caches.

You could try my soft: GitHub - Jorropo/linux2ipfs: Small pipeline and extreme-performance oriented IPFS implementation to upload files and deltas to pinning services very fast., it would be fairly easy to add blake3 support.

Interesting, thanks!

This appears to be a very fast way of chunking files into CAR format, right? It doesn’t transfer the data between two or more peers in a swarm?

2 Likes

Welcome @meermanr! Nice to have you here.

As @jorropo mentions, bitswap data transfer has been slower than many IPFS users need. The good news, though, is @bFive and team are working on a new higher-throughput data transfer protocol to be released in ~January. I’ll let him share details.

1 Like

:wave: @meermanr, super interesting project!

As others have mentioned, I work on Iroh, a bunch of us are actively working on the problem of transfer speeds, so we should have a better solution to your problem. I showed your post to other iroh maintainers & this is the exact kind of thing we’re hoping to fix.

However, I’d like to do a little expectation management on your use case, specifically with regard to that fancy 10+Gbps connection you have, and the block de-duplication property you’re after:

In practice, I think you’ll be forced to choose between de-duplication and transfer speeds.

Why the tradeoff? When you cut up a file & put it into blocks, then store those blocks on a hard drive addressed by hash, reading those blocks back turns into effectively random seeks across your disk, which bottlenecks your capacity to saturate that fancy internet tube. Even with solid state drives, memory mapping, using a database, all the hopping around a merkle tree comes with a cost, and there’s a very real chance we can’t read fast enough to saturate a 10Gpbs connection. I haven’t had the chance to work out the numbers on physical feasibility, but it’s safe to say saturating a 10Gbps connection will require structural changes to the way IPFS works.

So, while we’re at it, we should talk about deduplication, and check to make sure it’s actually real. In our tests the amount of internal de-duplication in common IPFS data (UnixFS DAGs, to be specific) we’ve found in the wild is negligible, on the order of less than 5% of the total content. Your example is looking at de-duplication across two different graphs. If those graphs are UnixFS graphs of filesystem data, you’ll surely get massive amounts of de-duplication, whenever files or directories exactly match. But if you’re putting 2 slightly different ISO images into IPFS, that will be treated as large byte streams, and for that you’d definitly want to look into the rabin chunker if you’re trying to maximize de-dupe.

I still think you can have you cake and eat it too. A swam with faster transfer & smarter caching should make up for a lack of de-duplication, but we’re going to have to break a bunch of stuff first :smile:.

Ps: we too are fans of blake3, like, big fans. More on that soon.

1 Like

Bitswap, the only data transfer implementations that fully supports automatic data locality is not fast.

I want to fix that RAPIDE - Jorropo - YouTube but this doesn’t exists yet soonTM.

We also have a datatransfer working group that has been kicked off at 2022 IPFS Camp: https://www.youtube.com/playlist?list=PLuhRWgmPaHtQ--aQ5GlgCyKkQYXUfQpVf

Currently, you have various other solutions which are faster and have more or less level of magic happening (graphsync, …).
The fastest one is to send .car files over HTTP to a cluster of server.

The only data transfer it knows to do is to stream the chunking output over HTTP POST requests.

I brought this up because I guess you are using ipfs add and just ipfs adding data is not fast (altho it’s orders of magnitude than the go-bitswap client & server shipping in Kubo), so I guessed that ipfs adding your files would take a while.

The only data transfer it knows to do is to stream the chunking output over HTTP POST requests.

I’d like to avoid any explicit network addresses - the magic of IPFS is that once the producers (cloud) and consumers (lab equipment) are members of the swarm I don’t need to micro-manage network addresses - I need only concern myself with the content (CIDs).

So I am trying to avoid pushing the data directly using (e.g.) rsync or S3, because then I need to explicitly manage where data resides. For example, if I used S3 over HTTP, I could use a caching proxy in the lab to accelerate data transfer, but I want to avoid that for a couple of reasons:

  1. The throughput requirements of this caching proxy server would make it quite expensive - perhaps a storage cluster in its own right.
  2. The threat model and security implications of caching data from multiple projects on a shared system would place burdens on my team, which I think we can avoid.

As regards the performance of centrally provisioned storage: our other (older) lab has a central NFS server providing for ~400 devices, non of which have any locally attached storage. Since this is all on-premise, adding more capacity of performance takes months and a significant expense.

For the new lab I want the opposite: every device has locally attached storage, and no central shared storage. If we use peer-to-peer file sharing between devices which are assigned to the same project, and I find a way to deduplicate the data over both space and time[1], then I think I can avoid having to manage shared storage systems.

Specifically, I imagine running multiple private IPFS swarms, one per confidentiality domain (i.e. project). The IPFS peers (devices in the lab) would be moved between swarms regularly to essentially time share them between the various projects,[2] and avoid my team having any responsibility for data at rest.

I brought this up because I guess you are using ipfs add and just ipfs add ing data is not fast (altho it’s orders of magnitude than the go-bitswap client & server shipping in Kubo), so I guessed that ipfs add ing your files would take a while.

Could I use this to quickly create a CAR file, and then import it with ipfs dag import?

As you can see in my explanation above, I expect to seed a swarm from a singe producer, but I expect a lot of fan-out and bit-swapping between the peers in the lab (who may wall want the same data if the user is running a parallel regression test, etc).


  1. Deduplication over space: one producer with a fan-out to many consumers, e.g. a test campaign. Deduplication of time: new content being 98% identical to previous content, e.g. edit-compile-test cycles of an individual developer. ↩︎

  2. We deep-clean all persistent storage on a given device when reassigning it. The host running IPFS runs from a RAM disk (i.e. no changes are persisted between reboots) and the locally-attached storage is encrypted using LUKS on every boot. So a reboot discards the decryption key. ↩︎

That looks like exactly what I am after! Thank you for sharing it.

It reminds me of bittorrent’s super seed feature, where the sole provider of some unique content will only sent out chunks it has not previous sent out (and then remove that chunk from its have list). This is the sort of behaviour that would be perfect for my use-case. (And I think you touched on this in the data transfer working group - thank you for linking me to that!)

1 Like

I was hoping to tune my workflows to maximise naive deduplication in IPFS. For example, if every disk image was derived from the same base image by mounting it, modifying a couple of files, unmounting it and then adding it to IPFS, I would expect that 98% of the disk image would be as it was before being mounted.

So I expect that a fixed-size chunker, such as the default --chunker size-262144, would come to the same conclusion: 98% of the original and modified disk images are unaltered: same data at exactly the same offset.

Therefore when the user transfers their modified disk image to the lab, I would expect that only 2% of the disk image is actually transmitted by that user since the other 98% of the data is already floating around the lab.

I have been wondering if my team should write a custom chunker to exploit the structure of the disk images - GUID partition tables, ExFAT / ext4 file-systems mostly. I don’t think we can use UnixFS because it doesn’t encoded extended attributes, which we need to satisfy SELinux boot conditions.

1 Like

The data-transfer issues are due to bugs in the go-bitswap implementation.

I’ve been using bitswap to transfer data across multiple clusters smaller than a thousand in the recent past and have been plagued by this issue as well.

What specifically is being implemented incorrectly that is causing the rate to be low? I’ve looked up two possible issues and I’d like to help fix it, would you mind expanding it?