I’ve been playing with IPFS private swarms as a background activity at my work for a few months, and am really excited about how we might use it. I’d like some help understanding its performance, and if there’s anything I can do to tune it.
Use case
My use-case is transferring largish files (~6-10 GiB) built in cloud compute to our on-premise lab equipment. I estimate that a naive implementation (i.e. without IPFS) will transfer ~175 TIB/day of data. To put that in perspective that would be a sustained rate of about 17.8 Gbps.
These large files are bootable disk images, such as the contents of Linux system running Debian. On a given day most of the disk images will be very similar to one another (perhaps 98% duplicate data between any two given disk images).
I’m excited to use IPFS as it will allow me to have my cake and eat it: any single user of the on-premise lab equipment can treat it as private by adding their just-compiled disk image to the swarm and instructing the equipment to boot from a given CID. The IPFS swarm will deduplicate the user’s disk image and likely only transfer the unique blocks across the WAN link because the duplicate blocks are likely already available in the lab - either on the equipment itself from a previous job, or from peers in the same racks which ran a related job.
I expect to have a single peer in the cloud and 100+ peers initially in the lab, one per item of equipment. I hope to grow this to all our lab equipment, so perhaps 600-1000 peers.
Performance
So far my experiments suggest IPFS cannot transfer data between two peers faster than ~250 Mbps (25MiB/s), which does not even remotely saturate our network links which are 10+ Gbps. I have not yet tried a scale test with a large number of peers. Should I expect that to be quicker?
Is there anything I can tune to increase throughput? Or reduce CPU load, if that is indeed my bottleneck. I benchmarked multihash performance and settled on using blake3 as its fastest on the embedded systems we’re using (which lack HW acceleration for SHA256).
The swarm is entirely private, and I find myself wondering if the default (256 kiB) and maximum (1 MiB) chunk sizes are too small to allow network flow control systems to reach full speed. Perhaps due to (e.g.) TCP window sizing? I’ve tried using 1MiB chunk sizes and it didn’t make a measurable difference.
I also wonder if enabling more concurrent traffic might help, i.e. larger want lists.
Any tips or ideas for experiments to try would be appreciated!