Supporting Large IPLD Blocks

Supporting Large IPLD Blocks

IPLD Blocks

An IPLD block is a sequence of bytes that can be referenced by a CID

Block Limits

Current IPFS implementations recommend users create blocks ≤1 MiB and that they are able to accept and transfer blocks ≤ 2MiB.

There is no hard block limit in IPLD or specified across all IPFS implementations, although the above are generally good guidelines for the ecosystem

Why it’s sad that we have block limits :cry:

Backwards compatibility with existing hash-linked data structures where the block limit was either chosen to be larger than ours (2MiB), or even when there wasn’t one chosen at all

People have been using hashes as checksums for Git commits, ISO downloads, torrents, antivirus checks, package managers, etc. for a while now and many of those hashes are of blocks of data larger than a couple of MiBs. This means you can’t reasonably do ipfs://SomeSHA256OfDockerContainerManifest and have it just work. Similarly you cannot do ipfs://SomeSHA256OfAnUbuntuISOFromTheWebsite and have that work.

Ultimately, the block limit introduces a limitation on the set of hash-linked data structures representable by IPLD such that many existing structures. On the IPLD website we have the line

IPLD is the data model of the content-addressable web. It allows us to treat all hash-linked data structures as subsets of a unified information space, unifying all data models that link data with hashes as instances of IPLD.

Which today looks like

IPLD is the data model of the content-addressable web. It allows us to treat all hash-linked data structures, with blocks at most 2MiB, as subsets of a unified information space, unifying **all some** data models that link data with hashes as instances of IPLD.

Why block limits?

A major reason why block limits exist is to enable incremental verifiability of data. For example, if I was given the SHA2-256 of a 100GB file it would be a shame to download the whole 100GB just to find out it was the wrong file. This kind of attack effectively makes incremental verifiability required in order to enable peer-to-peer (as opposed to a more trusted friend-to-friend) transfer of content-addressed data.

From what I can tell this has historically been the reason people have argued that we should have both implementation and ecosystem-wide block limits. However, more recently there have been pushes and proposals in the IPFS ecosystem that enable incrementally-verified transfer of large blocks in many scenarios.

This has resulted in some new arguments in favor of block limits that are that having some block limit is important in building IPLD tooling since the fixed size allows for various assumptions and technological simplifications to be made that otherwise could not be made.

Underlying both the security and tooling arguments is the implicit argument that it is helpful for users and our ecosystem that data be as transportable as possible and that it would be a shame if one set of IPFS applications chose 2MiB limits while another chose 10MiB since that 10MiB data wouldn’t be compatible with the 2MiB limited application.

Solving Block Limits - Data Transfer Security

As described more in depth in the presentation on this (slides), as well as in an early proposal for a number of common use cases we can deal with the security issues at the data transfer layer by leveraging some of the properties in common hash constructions

Merkle-Damgård Constructions

These include common hash functions like SHA1/2/3. In short, if you are able to assume freestart-collision resistance rather than only collision resistance (which appears safe for SHA2/3 and SHA1s collision resistance is in any event problematic) then you can download even a 100GB block in an incrementally verified way if you download if it backwards.

The start of this downloading backwards process looks like:

Merkle Tree Constructions

These include some newer hash functions like Blake3. In short, just like how a UnixFS file is incrementally verifiable due to being a merkle tree so too can you construct a hash function that is a merkle tree and also incrementally verifiable.

Solving Block Limits - IPLD Tooling

The argument that IPLD is even a reason for us to have block limits is new to me. The first time I heard this argument was at IPFS Thing 2022. Overall the idea is that it makes creating IPLD tooling more complicated and increases the probability that some data will work nicely in some IPFS implementations but not in others.

Historically having block limits be a part of IPLD has been rejected (e.g. in Max node size limitations · Issue #48 · ipld/ipld · GitHub and the maximum block size should be specified · Issue #193 · ipld/specs · GitHub), so this argument either implies that there should be changes to the IPLD specs or that there should be an IPFS-wide block limit due to IPLD tooling while still not having a block limit in IPLD itself which seems strange.

Below are the highlights of the impacted IPLD components:

Codecs

Supporting IPLD Codecs on blocks >2MiB will have some pain, but it’s nothing new:

  • Painful: Instead of being able to work with the serialized data all at once it has to be in pieces which might not be possible in certain environments for certain formats. This could lead to some data being not readable in some contexts, but not in others.
    • For example, building a streaming JSON decoder that can get useful information out of a single 100GB object might be painful so my DAG-JSON implementation might decide to only support handling blocks up to 2MiB here so as to be both simpler to implement and not run out of memory during processing. This means some IPFS implementations would be able to process some DAG-JSON data, but not others.

This bad thing about some data being not readable in some contexts, but not in others is not new:

  • Not all codecs (or even hash functions) are implemented in every IPFS implementation
  • For any codec you would be concerned about decoding a 100GB block there could be an ADL that does the same across a graph of a million 1MB blocks and would fit the current model

It’s also the kind of thing we automatically have to deal with in a world of remote dynamically loaded codecs and ADLs proposed in some of the WASM + IPFS integration proposals.

  • Any sort of dynamic loading involves running code from an untrusted source in a sandbox with some kind of resource limiting. If the resource limiting is not a globally agreed upon number then we end up with the same issues as having per-peer block limits, which is not particularly different from no block limits at all
    • While we could agree on global resource limits here, thus far none of the dynamic loading proponents I have spoken with are in support of having global resource limits

This is also an area that we can move slowly on and only expose pieces as we need them. Most large blocks tend to just be piles of bytes rather than more structured data. This means that if there were concerns here we could limit the scope and decide either that:

  1. The only IPLD codec that can apply to a large block is the raw codec which refers to plain bytes.
  2. The only IPLD codecs that can that can apply to a large block are ones that result after decoding into streamable bytes. This is similar to the above, but also allows for codecs that are simple transformations across bytes.

Data Storage

To support large blocks, it is likely that block storage systems will need to differentiate based on large and small data, especially if there are any transport-specific optimizations they’ll want to pre-compute locally.

For example, whether for a Blake3 or a large SHA256 the block storage may want some level of indirection to separate out the multihash → collection of chunks to load to reconstruct the data, as well as multihash → information needed to efficiently make the data available to the transport layer.

This is somewhat unfortunate, however:

  1. Many implementations already have these types of indirection layers in their key-value stores
    1. Boost and lotus have multihash → set of CAR files + offsets for the block
    2. Kubo’s filestore/urlstore has multihash → the location, offset and range where the block might live in at a file or URL
  2. It seems very likely the transport level optimizations will start to appear in data storage anyhow
    1. Git has pack-files for their trusted transfer protocol
    2. Synchronization protocols like dsync and ones based on invertible bloom filters will likely track the collections of blocks they are synchronizing

Data Transfer

More so than with storage the cost-to-get-started of building a new generic IPFS data transfer protocol increases as a function of increasing the block-limit.

While not every IPFS implementation even supports the same set of hash functions, the added complexity to support large blocks of data increases what it takes to build a new highly compatible IPFS implementation.

Groups that have chosen to rely on only a single data transfer protocol so far, will have to start adding support for new data transfer implementations if they want to support large blocks.

Alternatives - Just leave it alone

The alternative to removing the block limit generally ends up being to just leave it at 2MiB. Sure, we could increase it to 4, 10, 1000, etc. but either way you end up running into the same sorts of tradeoffs that have already been mentioned around security, ease-of-implementation, resource consumption, helping users make the “right decisions™” etc. so while we have to choose an arbitrary number sticking with the one we already have seems reasonable.

By leaving things as they are we continue to be in a world where most tooling in the IPFS ecosystem is unable to deal with a lot of the content addressable data that is already out there in places like:

  • Programming language package managers
  • Operating system package managers
  • Docker container registries
  • Blockchain storage systems like Arweave and Sia that chose larger block limits
  • Git
  • BitTorrent

While IPFS tooling can work with these formats when individual blocks are small and that’s sufficient for some use cases, there’s a whole lot of large block data that’s out there and sometimes the small percentage of something like Git repos or BitTorrent files that are not compatible is enough to hurt the compatibility story more generally.

The major alternatives then for interoperability frequently look something like:

  1. Have a trusted map of SHA256 of a 1GB Ubuntu ISO → UnixFS representation of that data and use that trusted map for lookups
    1. This requires a trusted map which is generally not what we want to do with our self-certifiable content-addressed data
  2. Convince people distributing graphs with large blocks that they should also/instead distribute graphs with small blocks
    1. While 1MB blocks seem generally better than 100MB blocks convincing people to change their patterns is more difficult. Why not let them see some of the benefits of interoperability with the rest of the IPFS ecosystem and then let them change things to get additional benefits, rather than forcing them to do all or nothing, or use something like a trusted mapping?

Removing Block Limits - Let’s do it!

We are now at a point with block limits where the question is no longer if we can do it, but should we. I think the answer here is yes! Unlocking interoperability with other content-addressed systems, where we safely can, seems like a big step forward that’s worth the additional cognitive overhead and technical complexity of needing to handle it.

While I don’t think building new systems with large blocks of data as the smallest units of data is a particularly good idea due to issues like the complexity of building efficient streaming parsers, data transfer protocols, etc. I’d like to see us build out compatibility even with those who disagree while also providing documentation and examples to help them understand why and how to make better data structures in the future.

For example, while there is now a proposal for a safe way to download a 1GB SHA256 block safely it’s poorly suited for things like video because the data must be downloaded backwards and there is no way to seek into the middle and start without reading everything from the end up until there. Helping people understand the tradeoffs and alternatives to make their own choices seems better than dictating our own.

The underlying premise of keeping block limits around in the ecosystem is that we will need to build more sophisticated tooling in order to deal with these large blocks which will end up increasing the complexity of new IPFS implementations that wish to be compatible with existing ones. While this type of compatibility and ease of implementation is admirable I think restricting what users can do in order to make them do the “right” thing isn’t really the “style” of the IPFS ecosystem.

We like to be the open tent of content addressing that says it’s fine to bring whatever data transfer protocol you want, whatever content routing system you want, or whatever content addressable data structures you want and still be part of the IPFS ecosystem. As it is today, not every IPFS implementation has tooling to work with all types of data equally.

If I had to choose a dividing line in the sand here around what types of data should be “in bounds” for IPFS it would be difficult. However, I think it would be that it should be possible for other implementations running in a variety of platforms, with a variety of threat models to all be able to get and process that data. However, just because they can get and process the data it doesn’t mean they have to.


References:

4 Likes

Yes… except 10 MiB would still be better than 2 MiB for many things. So it’s a oneliner improvement that could happen right away. Low hanging fruit. Final solution? No, but should probably be done anyways.

1 Like

Wouldn’t checking the advertised sha256 before allowing it to be pinned somewhat protect against a DoS attack? Sure I had to waste a bunch of bandwidth downloading a 1G file that was bogus but I’m not going to reprovide that content. Then the attacker is the only one providing the bogus content opening themselves to a bigger DDoS attack. If I download, verify and pin the content that would be like a vote that it’s legit. Sure I could setup a large number of bogus providers but I could also set up a large number of bogus providers that poison a large number of small blocks too. It seems like the only protection is that legit providers drown out bogus providers. As far as I know there isn’t any filtering of nodes that provide bad blocks.

It would be like punching yourself in the face because it made your adversary uncomfortable.

For npmjs → ipfs gateway larger block size would allow to migrate more content. It needs to fit into one block, IPFS CID is computed from npmjs sha256 hash. For filecoin integration larger blocks will speedup the transfers. Torrents can have blocks 64kB to 16MB.

Probably 4 MB blocks will be ok.

In the three comments so far I’ve seen proposals for 10MiB, 4MiB, and unlimited non-incrementally verified blocks.

From what I can tell none of these have any particular justification for why the number was chosen, or what use cases it unlocks that were not there before. Increasing the block limit ends up effectively hoisting that requirement onto everyone, which in a non-consensus distributed ecosystem isn’t something you can just do every year and should take into account the variety of IPFS usecases and platforms (in IoT, satellites, laptops, phones, server farms, etc.) and not making data untransferable across platforms.

That being said if there were some compelling usecase and stats to bump the magic number up to a different magic number that seems like something that could be addressed and pushed for, I just haven’t seen any so far.

For example, what’s the magic that 4MiB or 10MiB brings to the ecosystem that makes it worthwhile to bump the number to there but not further?


Not really. The DoS attack can occur at multiple levels for example there’s wasting bandwidth and storing the partial block in “temporary space” whether that’s memory or disk, etc. awaiting authentication.

Your claim that resource usage is symmetrical and that DDoS attacks are essentially inevitable even with a block limit of 100B since the adversary can advertise millions of peers that have that block is both true and not.

  • Yes, the adversary can get you to waste bandwidth and connections if they can put tons of bogus records in the content routing system.
    • However, clients can choose which peers to fetch data from and how much to trust the content routing system. That kubo doesn’t have anything fancy here on the client side doesn’t mean we should be hamstrung at the protocol layer.
    • For example, there have been recent proposals and interest for how to extend go-bitswap to allow for better selection and filtering of peers to use on the client side that could do all sorts of things (e.g. grow the amount of data they’re willing to download from a peer based on past behavior, penalize peers that send data we haven’t recently asked [them] for, limit data coming from IP ranges that haven’t proven historically useful …). See Add peer scoring / reputation tracking · Issue #577 · ipfs/go-bitswap · GitHub for a start on the issue.
  • No, bandwidth attacks aren’t the only problem. If you have to download a 100GB file and get all of it before it is verified you have to store that 100GB file in RAM (could OOM and kill the process) or force you to store it all on disk
    • If you wanted to go the temp file route rather than allowing OOMing there’s a bunch of complexity to worry about on the implementation layer. For example, using streaming protobuf decoding in Bitswap implementations, figuring out what to do if the scratch space for temp files is too small to complete either of two competing large blocks, etc.
    • There are also other problems that come from non-incrementally verifiable large blocks (e.g. difficulty resuming the download of an interrupted large block, problems downloading from peers in parallel, etc.)

Perhaps I’ve missed something and all these issues are easier to resolve than they appear. If so propose some changes for how they’d be handled. If handling infinite sized non-incrementally-verifiable blocks in an untrusted p2p network turns out to be easy than it seems like something IPFS implementations should support.


Agreed, although part of me wonders about the value brought to ecosystems like npm as a function of the percentage of data that can be transferred over IPFS. IIRC 2MiB covers a lot of packages already and 4MiB wouldn’t cover them all. Does that incremental move to 4 provide a lot of value or do we need a way to cover all of them to really get the added value beyond 2MiB blocks?

I don’t think larger blocks necessarily implies faster speedups. Look at BitTorrent. As you mentioned BitTorrent v1 has 64KiB to 16MiB blocks, however BitTorrent v2 has 16KiB blocks in the merkle-tree. If you want speed increases here there’s plenty to do at the data transfer protocol and implementation layers such that the block limit seems like a distraction.

1 Like

I mean, if you make a new default, it is like 20x less DHT annoucements per MB, 20x reduction of keys in datastores, less bitswap contention, better utilization of the streams, smaller memory footprint when indexes are memory mapped etc… it sounds like it does bring lots of tangible improvement, so it’s more a question of what are the downsides and from which number do they become unacceptable?

Now, there is no magic number, but we should not be holding a oneliner improvement on it not being a total solution. eMule used 9MB piece sizes back in the model dial-up days. We can always take the response size averages from the ipfs gateways and see what is the usual size of content and see if 4MiB would be better than 10MiB. etc.

Perhaps we should put this discussion in its own thread, as it is derailing the real topic (my bad :frowning: ).

I think the point of this discussion is that we cryptographically can do whatever size we want and still verify incrementally.

We don’t need a limit if we implement @adin proposal (you compute from the end, the IV is seeded by the remote node and you verify that after X bytes of data your recover the expected state (either the hash for the first one or the IV of the previous download)). (actually other pieces of the code like IPLD stack probably still need a limit so huge blocks will probably be raw only for a while)

The X is bigger and better less per-block-overhead and protocol Y use X already use big blocks so it’s fine is an argument that never ends.

Sounds like a solution right there and would move it from peer to peer to friend to friend like you mentioned previously.

You’ve already committed to downloading that 100GB file so you’re presumably going to need to be able to handle a file of that size. If it’s bogus it’s wasted effort but your intention is that it should succeed.

Seems like it all boils down to a trust issue.

Larget blocks leads to speedup there is considerable (about 3 times) speed increase (measuring wall clock) doing 1M blocks instead of default 256 kB.

Speed increase will reduce. It will be something like:

1M block 3 times faster then default one
2M block 3.5 times faster
4M block 3.8 times faster

This needs to be measured in different network conditions before deciding on block size.

I’ve been playing around with an idea that I haven’t completely thought out but I thought I’d share to get people’s feedback.

Is there a possibility that IPFS could support dynamic block sizes with the use of a composable hash like LTHash? With a composable hash hash(a) + hash(b) = hash(ab). So say a and b are 1M blocks I could request a larger 2M block hash(ab) and verify before requesting it that the larger block is the concatenation of the two smaller blocks. You could start at some higher level say 4M blocks, get what you can and then move progressively down to fill in what you need with smaller blocks.