We are storing large array-type data in the zarr format [1]. The zarr format, in theory, is well suited for being distributed through ipfs because it is stored as chunks plus metadata in JSON. There even have been some effort to map the format directly to IPLD [2].
For good performance, the correct chunking is absolutely important and an empirically good chunk size is in the 1-5MB range, which doesn’t map neatly to the 1MB supported by IPFS. I have just run a small experiment where I benchmarked simple copying via rsync vs ipfs get and rsync was faster by a factor of ~5x. I strongly suspect that the problem was the additional overhead due to the block storage (combined with the slow HDD of the sending machine).
Are there any efforts to increase the maximum blocksize supported by IPFS?
Iroh uses ~4KiB block sizes yet it’s extremely fast too.
ipfs get is not very optimized and ends up waiting on round trips due to how bitswap works (it does block by block requests so sometime no data flows through because the server is waiting for the client to tell it what the next block is), ipfs get doesn’t make optimal use of bitswap either.
If you want to run faster you need to use ipfs pin add Qmfoo in parallel of ipfs get Qmfoo this is because ipfs pin add has a smarter algorithms which downloads more blocks in parallel so it’s able to more fill the pipe.
Feather instead use the trustless gateway protocol so it does not need for the client to receive a block decode it and then ask the next block to the server, first the client sends the complete query and the server will do the decoding and push the blocks as fast as it can (until the pipe gets full and is blocked or until some other bottleneck like disk drive round trips are reached).
Thanks for the replies, this was quite enlightening and I hope that IPFS will support larger blocks at some point. This would be really important for scientific data!
kubo uses boxo as a library, right? So the faster UnixFS implementation will eventually arrive to kubo?