Supporting Large IPLD Blocks

(trying to answer my own question, I would assume various buffers are sized for small chunks, reads are sized to blocks instead of available memory, etc – but that doesn’t actually answer it)

Are you looking for more than the “why block limits” section of my original post Supporting Large IPLD Blocks ?

For example, if you’re wondering why despite me writing this up 3 years ago (wow) it still hasn’t happened the TLDR is funding / prioritization. It’s not a small lift to move this from demoable to having specs and proper integration into most existing tooling (are you looking for more detail here?) and thus far the people who make the funding calls haven’t found this high enough ROI.

I think this is still quite important for IPFS ecosystem growth, and planning to fight for it to make it onto the priority list for 2026. If you do too I can ping you on the docs related to 2026 public roadmapping as they start existing.


By the way, in case the 3 year duration of this thread gives you despair and you’re looking for some optimism note that the integration of HTTP Trustless Gateway retrieval into the IPFS mainnet p2p layer (vs just as a way of verifying responses from gateways that fetch data from the network) earlier this year is IMO useful help here in decreasing the lift. Since the HTTP Trustless Gateway API is already setup to handle block and graph (CAR) based retrievals it gives us a surface to handle the in-between area of verification of large blocks (vs say reworking Bitswap or creating a new protocol). Still a bunch of work required between where we are and supporting large blocks, but at least there’s some progress.

Well, specifically I am reacting to how much debate there has been itt about picking a larger-but-still-conservative value – despiar is indeed high and solutions in that vein would do relatively little for my pain. looks sideways at IPNI

And yes, trustless gateways are indeed something that gives me some hope.

Let me share some experiment result, if you want to increase block size for better deduplication.

It seems the block size doesn’t affect the deduplication ratio much.

Text-based Data

data: GitHub - beenotung/tslib: utils library in Typescript (including both source, built js file, and node_modules)

total_size: 624,806,286 bytes

Results

block size* storage size* # block reuse saved (bytes) saved %
1,024 155,365,444 483,353 469,440,842 75.13%
2,048 155,835,717 256,208 468,970,569 75.06%
4,096 156,015,582 146,666 468,790,704 75.03%
8,192 156,167,190 93,234 468,639,096 75.01%
16,384 156,281,878 67,585 468,524,408 74.99%
32,768 156,331,030 55,339 468,475,256 74.98%
65,536 156,429,334 49,467 468,376,952 74.96%
131,072 156,757,014 46,727 468,049,272 74.91%
262,144 157,019,158 45,461 467,787,128 74.87%
1,048,576 157,805,590 44,584 467,000,696 74.74%
4,194,304 160,951,318 44,420 463,854,968 74.24%
10,485,760 158,854,166 44,386 465,952,120 74.58%

Binary Data

data: backup app images of various versions of cursor (IDE forked from VSCode)

total_size: 2,822,251,957 bytes

block size* storage size* # block reuse saved (bytes) saved %
1,024 2,069,388,021 735,226 752,863,936 26.68%
2,048 2,069,908,597 367,358 752,343,360 26.66%
4,096 2,070,078,581 183,641 752,173,376 26.65%
8,192 2,070,413,557 91,778 751,838,400 26.64%
16,384 2,070,569,205 45,880 751,682,752 26.63%
32,768 2,070,847,733 22,932 751,404,224 26.62%
65,536 2,071,602,613 11,454 750,649,344 26.60%
131,072 2,072,782,261 5,718 749,469,696 26.56%
262,144 2,074,617,269 2,852 747,634,688 26.49%
1,048,576 2,087,200,181 701 735,051,776 26.04%
4,194,304 2,125,997,493 166 696,254,464 24.67%
10,485,760 2,182,620,597 61 639,631,360 22.66%

Remark: * the unit for storage size and saved size is bytes.

I know that deduplication ratio is not the only factor to consider, e.g. compat with bittorrent. Just sharing the result for your reference.

2 Likes

Hi all, I’m a little late for this discussion, but may I suggest a (maybe temporary) solution:

  • IPFS daemon administrator may set the bitswap block size limit. This definition may even be (as proposed in this thread) per IPLD type;
  • IPFS daemon administrator may also allow this to be overridden by users of the API;
  • IPLD/IPFS/IPNS URL may accept a maximum block size: `ipfs://max5G@SomeSHA256OfAnUbuntuISOFromTheWebsite`;

This model allows each node to define what’s their “DDoSable” threshold. Nodes may also make exceptions for blocks / IPNS dirs / entire DAGs that are important to their users. Having a system like this creates the grounds for a consensus based trust-list of nodes. Once the concept of multiple block sizes is normalized and some problems are identified, more efficient management methods can be developed.

Possible shortcomings:

  • Segregation of nodes based on maximum block limit;
  • Nodes should hint the block size when taking to other nodes, so transactions can be rejected right away;
  • The nodes may have to keep track of the limits on each node it’s connected to and blocks to avoid wasteful connections;
  • Extra node overhead managing an “optimal” network and making large data accessible.
1 Like

Thanks for sharing.

Regarding small blocks as a means for deduplication, I recently looked into this and shared why I think this is not the right trade-off if you consider the cost of announcements, and CID determinism and traversing DAGs over the network. Here’s the post where I elaborate on this:

As for deduplication more broadly, I’d check out content defined chunking (CDC) and the following paper: Analysis and Comparison of Deduplication Strategies in IPFS

1 Like