(trying to answer my own question, I would assume various buffers are sized for small chunks, reads are sized to blocks instead of available memory, etc – but that doesn’t actually answer it)
Are you looking for more than the “why block limits” section of my original post Supporting Large IPLD Blocks ?
For example, if you’re wondering why despite me writing this up 3 years ago (wow) it still hasn’t happened the TLDR is funding / prioritization. It’s not a small lift to move this from demoable to having specs and proper integration into most existing tooling (are you looking for more detail here?) and thus far the people who make the funding calls haven’t found this high enough ROI.
I think this is still quite important for IPFS ecosystem growth, and planning to fight for it to make it onto the priority list for 2026. If you do too I can ping you on the docs related to 2026 public roadmapping as they start existing.
By the way, in case the 3 year duration of this thread gives you despair and you’re looking for some optimism note that the integration of HTTP Trustless Gateway retrieval into the IPFS mainnet p2p layer (vs just as a way of verifying responses from gateways that fetch data from the network) earlier this year is IMO useful help here in decreasing the lift. Since the HTTP Trustless Gateway API is already setup to handle block and graph (CAR) based retrievals it gives us a surface to handle the in-between area of verification of large blocks (vs say reworking Bitswap or creating a new protocol). Still a bunch of work required between where we are and supporting large blocks, but at least there’s some progress.
Well, specifically I am reacting to how much debate there has been itt about picking a larger-but-still-conservative value – despiar is indeed high and solutions in that vein would do relatively little for my pain. looks sideways at IPNI
And yes, trustless gateways are indeed something that gives me some hope.
Let me share some experiment result, if you want to increase block size for better deduplication.
It seems the block size doesn’t affect the deduplication ratio much.
Text-based Data
data: GitHub - beenotung/tslib: utils library in Typescript (including both source, built js file, and node_modules)
total_size: 624,806,286 bytes
Results
| block size* | storage size* | # block reuse | saved (bytes) | saved % |
|---|---|---|---|---|
| 1,024 | 155,365,444 | 483,353 | 469,440,842 | 75.13% |
| 2,048 | 155,835,717 | 256,208 | 468,970,569 | 75.06% |
| 4,096 | 156,015,582 | 146,666 | 468,790,704 | 75.03% |
| 8,192 | 156,167,190 | 93,234 | 468,639,096 | 75.01% |
| 16,384 | 156,281,878 | 67,585 | 468,524,408 | 74.99% |
| 32,768 | 156,331,030 | 55,339 | 468,475,256 | 74.98% |
| 65,536 | 156,429,334 | 49,467 | 468,376,952 | 74.96% |
| 131,072 | 156,757,014 | 46,727 | 468,049,272 | 74.91% |
| 262,144 | 157,019,158 | 45,461 | 467,787,128 | 74.87% |
| 1,048,576 | 157,805,590 | 44,584 | 467,000,696 | 74.74% |
| 4,194,304 | 160,951,318 | 44,420 | 463,854,968 | 74.24% |
| 10,485,760 | 158,854,166 | 44,386 | 465,952,120 | 74.58% |
Binary Data
data: backup app images of various versions of cursor (IDE forked from VSCode)
total_size: 2,822,251,957 bytes
| block size* | storage size* | # block reuse | saved (bytes) | saved % |
|---|---|---|---|---|
| 1,024 | 2,069,388,021 | 735,226 | 752,863,936 | 26.68% |
| 2,048 | 2,069,908,597 | 367,358 | 752,343,360 | 26.66% |
| 4,096 | 2,070,078,581 | 183,641 | 752,173,376 | 26.65% |
| 8,192 | 2,070,413,557 | 91,778 | 751,838,400 | 26.64% |
| 16,384 | 2,070,569,205 | 45,880 | 751,682,752 | 26.63% |
| 32,768 | 2,070,847,733 | 22,932 | 751,404,224 | 26.62% |
| 65,536 | 2,071,602,613 | 11,454 | 750,649,344 | 26.60% |
| 131,072 | 2,072,782,261 | 5,718 | 749,469,696 | 26.56% |
| 262,144 | 2,074,617,269 | 2,852 | 747,634,688 | 26.49% |
| 1,048,576 | 2,087,200,181 | 701 | 735,051,776 | 26.04% |
| 4,194,304 | 2,125,997,493 | 166 | 696,254,464 | 24.67% |
| 10,485,760 | 2,182,620,597 | 61 | 639,631,360 | 22.66% |
Remark: * the unit for storage size and saved size is bytes.
I know that deduplication ratio is not the only factor to consider, e.g. compat with bittorrent. Just sharing the result for your reference.
Hi all, I’m a little late for this discussion, but may I suggest a (maybe temporary) solution:
- IPFS daemon administrator may set the bitswap block size limit. This definition may even be (as proposed in this thread) per IPLD type;
- IPFS daemon administrator may also allow this to be overridden by users of the API;
- IPLD/IPFS/IPNS URL may accept a maximum block size: `
ipfs://max5G@SomeSHA256OfAnUbuntuISOFromTheWebsite`;
This model allows each node to define what’s their “DDoSable” threshold. Nodes may also make exceptions for blocks / IPNS dirs / entire DAGs that are important to their users. Having a system like this creates the grounds for a consensus based trust-list of nodes. Once the concept of multiple block sizes is normalized and some problems are identified, more efficient management methods can be developed.
Possible shortcomings:
- Segregation of nodes based on maximum block limit;
- Nodes should hint the block size when taking to other nodes, so transactions can be rejected right away;
- The nodes may have to keep track of the limits on each node it’s connected to and blocks to avoid wasteful connections;
- Extra node overhead managing an “optimal” network and making large data accessible.
Thanks for sharing.
Regarding small blocks as a means for deduplication, I recently looked into this and shared why I think this is not the right trade-off if you consider the cost of announcements, and CID determinism and traversing DAGs over the network. Here’s the post where I elaborate on this:
As for deduplication more broadly, I’d check out content defined chunking (CDC) and the following paper: Analysis and Comparison of Deduplication Strategies in IPFS