Supporting Large IPLD Blocks

Supporting Large IPLD Blocks

IPLD Blocks

An IPLD block is a sequence of bytes that can be referenced by a CID

Block Limits

Current IPFS implementations recommend users create blocks ≤1 MiB and that they are able to accept and transfer blocks ≤ 2MiB.

There is no hard block limit in IPLD or specified across all IPFS implementations, although the above are generally good guidelines for the ecosystem

Why it’s sad that we have block limits :cry:

Backwards compatibility with existing hash-linked data structures where the block limit was either chosen to be larger than ours (2MiB), or even when there wasn’t one chosen at all

People have been using hashes as checksums for Git commits, ISO downloads, torrents, antivirus checks, package managers, etc. for a while now and many of those hashes are of blocks of data larger than a couple of MiBs. This means you can’t reasonably do ipfs://SomeSHA256OfDockerContainerManifest and have it just work. Similarly you cannot do ipfs://SomeSHA256OfAnUbuntuISOFromTheWebsite and have that work.

Ultimately, the block limit introduces a limitation on the set of hash-linked data structures representable by IPLD such that many existing structures. On the IPLD website we have the line

IPLD is the data model of the content-addressable web. It allows us to treat all hash-linked data structures as subsets of a unified information space, unifying all data models that link data with hashes as instances of IPLD.

Which today looks like

IPLD is the data model of the content-addressable web. It allows us to treat all hash-linked data structures, with blocks at most 2MiB, as subsets of a unified information space, unifying **all some** data models that link data with hashes as instances of IPLD.

Why block limits?

A major reason why block limits exist is to enable incremental verifiability of data. For example, if I was given the SHA2-256 of a 100GB file it would be a shame to download the whole 100GB just to find out it was the wrong file. This kind of attack effectively makes incremental verifiability required in order to enable peer-to-peer (as opposed to a more trusted friend-to-friend) transfer of content-addressed data.

From what I can tell this has historically been the reason people have argued that we should have both implementation and ecosystem-wide block limits. However, more recently there have been pushes and proposals in the IPFS ecosystem that enable incrementally-verified transfer of large blocks in many scenarios.

This has resulted in some new arguments in favor of block limits that are that having some block limit is important in building IPLD tooling since the fixed size allows for various assumptions and technological simplifications to be made that otherwise could not be made.

Underlying both the security and tooling arguments is the implicit argument that it is helpful for users and our ecosystem that data be as transportable as possible and that it would be a shame if one set of IPFS applications chose 2MiB limits while another chose 10MiB since that 10MiB data wouldn’t be compatible with the 2MiB limited application.

Solving Block Limits - Data Transfer Security

As described more in depth in the presentation on this (slides), as well as in an early proposal for a number of common use cases we can deal with the security issues at the data transfer layer by leveraging some of the properties in common hash constructions

Merkle-DamgĂĄrd Constructions

These include common hash functions like SHA1/2/3. In short, if you are able to assume freestart-collision resistance rather than only collision resistance (which appears safe for SHA2/3 and SHA1s collision resistance is in any event problematic) then you can download even a 100GB block in an incrementally verified way if you download if it backwards.

The start of this downloading backwards process looks like:

Merkle Tree Constructions

These include some newer hash functions like Blake3. In short, just like how a UnixFS file is incrementally verifiable due to being a merkle tree so too can you construct a hash function that is a merkle tree and also incrementally verifiable.

Solving Block Limits - IPLD Tooling

The argument that IPLD is even a reason for us to have block limits is new to me. The first time I heard this argument was at IPFS Thing 2022. Overall the idea is that it makes creating IPLD tooling more complicated and increases the probability that some data will work nicely in some IPFS implementations but not in others.

Historically having block limits be a part of IPLD has been rejected (e.g. in Max node size limitations · Issue #48 · ipld/ipld · GitHub and the maximum block size should be specified · Issue #193 · ipld/specs · GitHub), so this argument either implies that there should be changes to the IPLD specs or that there should be an IPFS-wide block limit due to IPLD tooling while still not having a block limit in IPLD itself which seems strange.

Below are the highlights of the impacted IPLD components:

Codecs

Supporting IPLD Codecs on blocks >2MiB will have some pain, but it’s nothing new:

  • Painful: Instead of being able to work with the serialized data all at once it has to be in pieces which might not be possible in certain environments for certain formats. This could lead to some data being not readable in some contexts, but not in others.
    • For example, building a streaming JSON decoder that can get useful information out of a single 100GB object might be painful so my DAG-JSON implementation might decide to only support handling blocks up to 2MiB here so as to be both simpler to implement and not run out of memory during processing. This means some IPFS implementations would be able to process some DAG-JSON data, but not others.

This bad thing about some data being not readable in some contexts, but not in others is not new:

  • Not all codecs (or even hash functions) are implemented in every IPFS implementation
  • For any codec you would be concerned about decoding a 100GB block there could be an ADL that does the same across a graph of a million 1MB blocks and would fit the current model

It’s also the kind of thing we automatically have to deal with in a world of remote dynamically loaded codecs and ADLs proposed in some of the WASM + IPFS integration proposals.

  • Any sort of dynamic loading involves running code from an untrusted source in a sandbox with some kind of resource limiting. If the resource limiting is not a globally agreed upon number then we end up with the same issues as having per-peer block limits, which is not particularly different from no block limits at all
    • While we could agree on global resource limits here, thus far none of the dynamic loading proponents I have spoken with are in support of having global resource limits

This is also an area that we can move slowly on and only expose pieces as we need them. Most large blocks tend to just be piles of bytes rather than more structured data. This means that if there were concerns here we could limit the scope and decide either that:

  1. The only IPLD codec that can apply to a large block is the raw codec which refers to plain bytes.
  2. The only IPLD codecs that can that can apply to a large block are ones that result after decoding into streamable bytes. This is similar to the above, but also allows for codecs that are simple transformations across bytes.

Data Storage

To support large blocks, it is likely that block storage systems will need to differentiate based on large and small data, especially if there are any transport-specific optimizations they’ll want to pre-compute locally.

For example, whether for a Blake3 or a large SHA256 the block storage may want some level of indirection to separate out the multihash → collection of chunks to load to reconstruct the data, as well as multihash → information needed to efficiently make the data available to the transport layer.

This is somewhat unfortunate, however:

  1. Many implementations already have these types of indirection layers in their key-value stores
    1. Boost and lotus have multihash → set of CAR files + offsets for the block
    2. Kubo’s filestore/urlstore has multihash → the location, offset and range where the block might live in at a file or URL
  2. It seems very likely the transport level optimizations will start to appear in data storage anyhow
    1. Git has pack-files for their trusted transfer protocol
    2. Synchronization protocols like dsync and ones based on invertible bloom filters will likely track the collections of blocks they are synchronizing

Data Transfer

More so than with storage the cost-to-get-started of building a new generic IPFS data transfer protocol increases as a function of increasing the block-limit.

While not every IPFS implementation even supports the same set of hash functions, the added complexity to support large blocks of data increases what it takes to build a new highly compatible IPFS implementation.

Groups that have chosen to rely on only a single data transfer protocol so far, will have to start adding support for new data transfer implementations if they want to support large blocks.

Alternatives - Just leave it alone

The alternative to removing the block limit generally ends up being to just leave it at 2MiB. Sure, we could increase it to 4, 10, 1000, etc. but either way you end up running into the same sorts of tradeoffs that have already been mentioned around security, ease-of-implementation, resource consumption, helping users make the “right decisions™” etc. so while we have to choose an arbitrary number sticking with the one we already have seems reasonable.

By leaving things as they are we continue to be in a world where most tooling in the IPFS ecosystem is unable to deal with a lot of the content addressable data that is already out there in places like:

  • Programming language package managers
  • Operating system package managers
  • Docker container registries
  • Blockchain storage systems like Arweave and Sia that chose larger block limits
  • Git
  • BitTorrent
  • …

While IPFS tooling can work with these formats when individual blocks are small and that’s sufficient for some use cases, there’s a whole lot of large block data that’s out there and sometimes the small percentage of something like Git repos or BitTorrent files that are not compatible is enough to hurt the compatibility story more generally.

The major alternatives then for interoperability frequently look something like:

  1. Have a trusted map of SHA256 of a 1GB Ubuntu ISO → UnixFS representation of that data and use that trusted map for lookups
    1. This requires a trusted map which is generally not what we want to do with our self-certifiable content-addressed data
  2. Convince people distributing graphs with large blocks that they should also/instead distribute graphs with small blocks
    1. While 1MB blocks seem generally better than 100MB blocks convincing people to change their patterns is more difficult. Why not let them see some of the benefits of interoperability with the rest of the IPFS ecosystem and then let them change things to get additional benefits, rather than forcing them to do all or nothing, or use something like a trusted mapping?

Removing Block Limits - Let’s do it!

We are now at a point with block limits where the question is no longer if we can do it, but should we. I think the answer here is yes! Unlocking interoperability with other content-addressed systems, where we safely can, seems like a big step forward that’s worth the additional cognitive overhead and technical complexity of needing to handle it.

While I don’t think building new systems with large blocks of data as the smallest units of data is a particularly good idea due to issues like the complexity of building efficient streaming parsers, data transfer protocols, etc. I’d like to see us build out compatibility even with those who disagree while also providing documentation and examples to help them understand why and how to make better data structures in the future.

For example, while there is now a proposal for a safe way to download a 1GB SHA256 block safely it’s poorly suited for things like video because the data must be downloaded backwards and there is no way to seek into the middle and start without reading everything from the end up until there. Helping people understand the tradeoffs and alternatives to make their own choices seems better than dictating our own.

The underlying premise of keeping block limits around in the ecosystem is that we will need to build more sophisticated tooling in order to deal with these large blocks which will end up increasing the complexity of new IPFS implementations that wish to be compatible with existing ones. While this type of compatibility and ease of implementation is admirable I think restricting what users can do in order to make them do the “right” thing isn’t really the “style” of the IPFS ecosystem.

We like to be the open tent of content addressing that says it’s fine to bring whatever data transfer protocol you want, whatever content routing system you want, or whatever content addressable data structures you want and still be part of the IPFS ecosystem. As it is today, not every IPFS implementation has tooling to work with all types of data equally.

If I had to choose a dividing line in the sand here around what types of data should be “in bounds” for IPFS it would be difficult. However, I think it would be that it should be possible for other implementations running in a variety of platforms, with a variety of threat models to all be able to get and process that data. However, just because they can get and process the data it doesn’t mean they have to.


References:

4 Likes

Yes… except 10 MiB would still be better than 2 MiB for many things. So it’s a oneliner improvement that could happen right away. Low hanging fruit. Final solution? No, but should probably be done anyways.

1 Like

Wouldn’t checking the advertised sha256 before allowing it to be pinned somewhat protect against a DoS attack? Sure I had to waste a bunch of bandwidth downloading a 1G file that was bogus but I’m not going to reprovide that content. Then the attacker is the only one providing the bogus content opening themselves to a bigger DDoS attack. If I download, verify and pin the content that would be like a vote that it’s legit. Sure I could setup a large number of bogus providers but I could also set up a large number of bogus providers that poison a large number of small blocks too. It seems like the only protection is that legit providers drown out bogus providers. As far as I know there isn’t any filtering of nodes that provide bad blocks.

It would be like punching yourself in the face because it made your adversary uncomfortable.

For npmjs → ipfs gateway larger block size would allow to migrate more content. It needs to fit into one block, IPFS CID is computed from npmjs sha256 hash. For filecoin integration larger blocks will speedup the transfers. Torrents can have blocks 64kB to 16MB.

Probably 4 MB blocks will be ok.

In the three comments so far I’ve seen proposals for 10MiB, 4MiB, and unlimited non-incrementally verified blocks.

From what I can tell none of these have any particular justification for why the number was chosen, or what use cases it unlocks that were not there before. Increasing the block limit ends up effectively hoisting that requirement onto everyone, which in a non-consensus distributed ecosystem isn’t something you can just do every year and should take into account the variety of IPFS usecases and platforms (in IoT, satellites, laptops, phones, server farms, etc.) and not making data untransferable across platforms.

That being said if there were some compelling usecase and stats to bump the magic number up to a different magic number that seems like something that could be addressed and pushed for, I just haven’t seen any so far.

For example, what’s the magic that 4MiB or 10MiB brings to the ecosystem that makes it worthwhile to bump the number to there but not further?


Not really. The DoS attack can occur at multiple levels for example there’s wasting bandwidth and storing the partial block in “temporary space” whether that’s memory or disk, etc. awaiting authentication.

Your claim that resource usage is symmetrical and that DDoS attacks are essentially inevitable even with a block limit of 100B since the adversary can advertise millions of peers that have that block is both true and not.

  • Yes, the adversary can get you to waste bandwidth and connections if they can put tons of bogus records in the content routing system.
    • However, clients can choose which peers to fetch data from and how much to trust the content routing system. That kubo doesn’t have anything fancy here on the client side doesn’t mean we should be hamstrung at the protocol layer.
    • For example, there have been recent proposals and interest for how to extend go-bitswap to allow for better selection and filtering of peers to use on the client side that could do all sorts of things (e.g. grow the amount of data they’re willing to download from a peer based on past behavior, penalize peers that send data we haven’t recently asked [them] for, limit data coming from IP ranges that haven’t proven historically useful …). See Add peer scoring / reputation tracking · Issue #577 · ipfs/go-bitswap · GitHub for a start on the issue.
  • No, bandwidth attacks aren’t the only problem. If you have to download a 100GB file and get all of it before it is verified you have to store that 100GB file in RAM (could OOM and kill the process) or force you to store it all on disk
    • If you wanted to go the temp file route rather than allowing OOMing there’s a bunch of complexity to worry about on the implementation layer. For example, using streaming protobuf decoding in Bitswap implementations, figuring out what to do if the scratch space for temp files is too small to complete either of two competing large blocks, etc.
    • There are also other problems that come from non-incrementally verifiable large blocks (e.g. difficulty resuming the download of an interrupted large block, problems downloading from peers in parallel, etc.)

Perhaps I’ve missed something and all these issues are easier to resolve than they appear. If so propose some changes for how they’d be handled. If handling infinite sized non-incrementally-verifiable blocks in an untrusted p2p network turns out to be easy than it seems like something IPFS implementations should support.


Agreed, although part of me wonders about the value brought to ecosystems like npm as a function of the percentage of data that can be transferred over IPFS. IIRC 2MiB covers a lot of packages already and 4MiB wouldn’t cover them all. Does that incremental move to 4 provide a lot of value or do we need a way to cover all of them to really get the added value beyond 2MiB blocks?

I don’t think larger blocks necessarily implies faster speedups. Look at BitTorrent. As you mentioned BitTorrent v1 has 64KiB to 16MiB blocks, however BitTorrent v2 has 16KiB blocks in the merkle-tree. If you want speed increases here there’s plenty to do at the data transfer protocol and implementation layers such that the block limit seems like a distraction.

1 Like

I mean, if you make a new default, it is like 20x less DHT annoucements per MB, 20x reduction of keys in datastores, less bitswap contention, better utilization of the streams, smaller memory footprint when indexes are memory mapped etc… it sounds like it does bring lots of tangible improvement, so it’s more a question of what are the downsides and from which number do they become unacceptable?

Now, there is no magic number, but we should not be holding a oneliner improvement on it not being a total solution. eMule used 9MB piece sizes back in the model dial-up days. We can always take the response size averages from the ipfs gateways and see what is the usual size of content and see if 4MiB would be better than 10MiB. etc.

Perhaps we should put this discussion in its own thread, as it is derailing the real topic (my bad :frowning: ).

I think the point of this discussion is that we cryptographically can do whatever size we want and still verify incrementally.

We don’t need a limit if we implement @adin proposal (you compute from the end, the IV is seeded by the remote node and you verify that after X bytes of data your recover the expected state (either the hash for the first one or the IV of the previous download)). (actually other pieces of the code like IPLD stack probably still need a limit so huge blocks will probably be raw only for a while)

The X is bigger and better less per-block-overhead and protocol Y use X already use big blocks so it’s fine is an argument that never ends.

Sounds like a solution right there and would move it from peer to peer to friend to friend like you mentioned previously.

You’ve already committed to downloading that 100GB file so you’re presumably going to need to be able to handle a file of that size. If it’s bogus it’s wasted effort but your intention is that it should succeed.

Seems like it all boils down to a trust issue.

Larget blocks leads to speedup there is considerable (about 3 times) speed increase (measuring wall clock) doing 1M blocks instead of default 256 kB.

Speed increase will reduce. It will be something like:

1M block 3 times faster then default one
2M block 3.5 times faster
4M block 3.8 times faster

This needs to be measured in different network conditions before deciding on block size.

I’ve been playing around with an idea that I haven’t completely thought out but I thought I’d share to get people’s feedback.

Is there a possibility that IPFS could support dynamic block sizes with the use of a composable hash like LTHash? With a composable hash hash(a) + hash(b) = hash(ab). So say a and b are 1M blocks I could request a larger 2M block hash(ab) and verify before requesting it that the larger block is the concatenation of the two smaller blocks. You could start at some higher level say 4M blocks, get what you can and then move progressively down to fill in what you need with smaller blocks.

No worries, seems like a reasonable set of questions you’re not the only one who will ask/bring up. Can do it here or elsewhere :person_shrugging:. I do think if we want to choose a new magic number that should be a new thread. This one is about what to do when the new magic number ends up being too small (which it always will be). However, conceptually discussing if this approach is viable (or if there are better alternatives) seems fine here. Criticism welcome :sweat_smile:.

I would say yes/no here.

For basically all of these consider BitTorrent and how BitTorrent-v2 has smaller block sizes (and many BitTorrent-v1 clients defaulted to smaller block sizes) and yet advertise less in their DHT, have reasonable resource utilization, and IIUC could work with small indices as well. BitTorrent has different guarantees and properties than IPFS does, but the point is that it’s not so much the block limit that is what’s effecting things here as the protocols, data structures, and expectations/guarantees around them.

However, even in a world where we want to be able to address chunks of data inside a larger object independently, but want the minimal reference size to be larger than 2MiB it seems to me that you’d still want incremental verifiability for parallelizability, and resource consumption purposes. As mentioned above you could do this with tree-hashes.

Sure, it adds more complexity if when downloading say a 50MB Blake3 blob you need to also get some merkle-proof components and potentially even leverage an alternative protocol but in exchange you can end up resolving a bunch of the issues here.

Breaking down some of the issues:

  • less DHT announcements per MB
    • No kubo doesn’t make this particularly easy at the moment, but at the protocol layers there’s no reason why you couldn’t just advertise less. For example, if you wished you could only advertise a single 100MB block you could also just advertise the root of the UnixFS file that has that 100MB of data referenced.
    • There are some core tradeoffs here between having independently addressable chunks and minimizing the number of advertisements. In general the only use cases I’ve heard of for large blocks for new data are large files/binary blobs
    • To some extent the more core tradeoffs here come from:
      • you only need to advertise the independently namable things that people might want to reference.
        • e.g. “that scene in The Lion King where Rafiki holds up Simba” seems independently nameable, but minutes 10-15 of The Lion King really just falls under the name “The Lion King”
      • in an open distributed system like IPFS there’s no way to know in advance what people might want to reference so you may want to over-advertise
        • IPLD links only cover CIDs not paths (e.g. no linking to bytes 1M-2M of a 10MB block) which for some structures might encourage people to over-advertise
  • less bitswap contention/better utilization of the streams
    • Are you sure, this doesn’t seem accurate and if true is likely a manifestation of the current go-bitswap implementation rather than anything that’s associated with either the Bitswap protocol or block sizes in general.
  • reduction of keys in datastores/smaller memory footprint when indexes are memory mapped
    • This seems fair in that if you want small indices + Bitswap and all you’ve got is hundreds of MBs of encrypted data you might decide that you want larger blocks at the expense of greater download parallelism. However, you could get both utilizing a tree-hash like Blake3 along with a download protocol slightly different from Bitswap that allowed for requests like “Give me the blocks satisfying the byte range X-Y in this Blake3 CID”

Why advantages are you seeing for using this rather than some merkle-tree based hash function such as Blake3?

Funny thing is that LTHash uses Blake2x in the implementation but I have to look closer to see how it’s used. I haven’t worked out exactly but I have a vague intuition that it would allow you to subdivide blocks without storage overhead. If you store a file with 2m blocks and then store the same file with 1m blocks you’d have two copies. With a homomorphic hash you could verify that the 2m block is equivalent to the two 1k blocks. It would be nice if when someone added a file with 1m blocks and then subsequently with 2m blocks it could just say, “hey I’ve already got all this info and just add a pointer”.

It would be really cool if you could do the same thing but if someone chose a fix size block and someone else chose a content based chunker to take the union of the splits but only store it once unlike now where I believe you’d most likely have two complete copies of the data.

I’ve looked at LTHash and it is extremely cryptographically unsafe.

Basically lthash("test1", "test2") == lthash("test2", "test1").
Also it doesn’t do: lthash(a) + lthash(b) = lthash(ab).

You can look at the implementation it should be obvious why it’s completely unusable:

It is literally just summing the current sum vector as u16 with the next one, which is commutative and easy to fully roll.

See this in the readme also:

Be aware that LtHash is vulnerable to multiset input collisions. A multiset is a set containing more than one instance of a particular element. In particular, it is trivial to produce a collision in lthash16 by adding the same input to the hash 2^16 times. One way to prevent this is to concatenate each input with a unique piece of metadata, such as an index.

AFAIT lthash is meant for non cryptographic purposes (comparing sets of trusted inputs).


I can think of a way to create a hash function such that hash(a) + hash(b) = hash(ab) without toomuch overhead (digestSize = sumSize * popcount((inputSize - 1) // blockSize + 1 if inputSize > 0 else 0)) (imagine a merkle-tree hash but that encodes incomplete parts of the tree into the final sum).
However it would have uncontrollable variable sized outputs and wouldn’t hash trailing data (if the the file size is not a multiple of the hasher blocksize).
Edit: Actually it would only work with chunks that are a multiple of the hasher blocksize, and you would have some allignement requirements to concatenate hashes.

1 Like

Ok, thanks for looking into it. the original use was secure update propagation at Facebook Homomorphic hashing for secure update propagation - Engineering at Meta

It does if a and b are sets and ab is the union.

1 Like

I see, so it is used for comparing sets not digests (digests are the atoms of the set).
And as long as your set is smaller than 2**16 you should be fine ?

Even assuming this is true we can’t use it for what we want because files aren’t unordered sets of blocks, the order is extremely important.

Which is why they recommend adding an index to the value. From their Github page

“trivial to produce a collision in lthash16 by adding the same input to the hash 2^16 times. One way to prevent this is to concatenate each input with a unique piece of metadata, such as an index.”

There are also 20 and 32 bit version specified but not implemented.

Facebook’s paper further specifies lthash20 and lthash32; these may be added in the future.

The scheme is specified in a paper from Bellare and Micciancio titled “A New Paradigm for Collision-free Hashing: Incrementality at Reduced Cost” https://cseweb.ucsd.edu/~daniele/papers/IncHash.pdf

I’m not saying it’s a slam dunk for use with IPFS or how exactly it would be used but it’s certainly interesting and seems like something that might be useful somewhere. I was hoping that someone would look at it and say, “oh, I know just what to do with that”.

TL;DR

  • Do NOT impose a fixed size on blocks. Allow blocks of different sizes because different blocks have different uses and therefore different optimal sizes.
  • A soft-limit of 2MiB on block sizes is pretty good actually
  • Separate the notion of an object from that of a file
  • Introduce an object layer between blocks and IPLD
  • Have IPLD address objects and not blocks
  • Treat objects as a sequence of fixed size blocks roughly equivalent to a unix file but without timestamps or access controls

Introduction

I am delighted to have found this discussion. I was working with IPLD, reading all the documentation, and found myself disturbed that the semantics around the boundary of IPLD objects was unclear w.r.t. size. I wanted to be able to create an arbitrary, structured data object and save it, getting back in return some sort of immutable object address that I could later use in secondary structures such as B-tree or HAMT indexes. What I found instead was a loose restriction on object size that, from the the perspective of object creation, had no semantic relationship to the object itself.

Like others, I started the process of thinking about alternatives (hacks). That is, until I found this delightful thread.

The question from the thread is “should we increase the IPFS block size to support larger IPLD objects?”. I think my answer is “no”. And the reason is that we need an layer of abstraction between blocks and IPLD objects.

Lessons from Filesystems

My thought before I found this discussion was to look into what various filesystems do to handle large files. There are some general patterns across ext2/3/4, zfs, and others.

The general pattern (favoring ext2) seems to be to have three types of objects:

  • superblock - structure to define the overall parameters of the filesystem
  • inode table - block or blocks (ext2) containing map from inode numbers to physical locations on disk or links to inode blocks
  • inodes - index entries inside the inode table organizing a set of blocks into a single file
  • inode blocks - single, doubly, or triply linked off the inode expanding the number of blocks that can be referred to by an inode
  • block - the data itself

What’s nice about this structure is that it clearly distinguishes between the concept of a file or an object (zfs parlance), and a block of storage.

I think what has gotten a bit mixed up in the design of IPLD on IPFS is that IPFS is the block layer, not the file or object layer. IPLD is most analogous to a file of JSON data (assuming cbor encoding), not a disk block. In this analogy, a large JSON file might be composed of hundreds or thousands of blocks that, to the higher, filesystem layer appears as a unified thing composed of a single stream or buffer of data. Within that single buffer is, in the case of a JSON document, a single object. Note that once we have a handle to the object, we stop caring about the blocks. We no longer think about whether a given string, or array, or buffer of bytes is aligned along block boundaries.

What is needed is a layered approach where, at a low level, we can chunk up and reference a large binary object composed of blocks, and a higher level where we reference objects.

IPFS does have the UnixFS structure that puts filesystem concepts on top of IPFS blocks. This is definitely a step in the right direction.

For reference, here is the protobuf schema for UnixFS:

message Data {

    enum DataType {
        Raw = 0;
        Directory = 1;
        File = 2;
        Metadata = 3;
        Symlink = 4;
        HAMTShard = 5;
    }

    required DataType Type = 1;
    optional bytes Data = 2;
    optional uint64 filesize = 3;
    repeated uint64 blocksizes = 4;
    optional uint64 hashType = 5;
    optional uint64 fanout = 6;
    optional uint32 mode = 7;
    optional UnixTime mtime = 8;
}

  

message Metadata {
    optional string MimeType = 1;
}

  

message UnixTime {
    required int64 Seconds = 1;
    optional fixed32 FractionalNanoseconds = 2;

}

Criticisms of UnixFS

Filesystem oriented

Though I want to use Unix filesystem types as a point of comparison, IPLD does not naturally correspond to a structure stored in files and directories. When reasoning about the structure of the object I do not care where it is stored on disk. In fact, I would rather not think about location at all. What I care about is identity. Some underlying storage mechanism - e.g. an ARC cache with semantics to fetch requested but not cached objects - could return objects by identity without involving the caller in storage layout.

Permissions model implies a trusted service

The mode field is intended to capture the -rwxrwxrwx file permission bitmask common to unix file systems. There is no natural way that this maps to IPLD as there is no universal set of users and groups. Plus, most Unix systems augment the simple mode with ACLs while other systems apply object capability models (e.g. WNFS). It would be better to leave permissions as a concern external to block storage.

Time can be forged

The mtime field in UnixFS is similar to the atime (access time), ctime (creation time), and mtime (modification time) concepts in ext2. These are natural desires in filesystems but perhaps should be left to the object structure in something like IPLD.

If we want to accomplish an untrusted system, we would be wise to avoid the use of timestamps. Timestamps are often used as an input to rule execution. For example, consider an RBAC system that tries to determine whether the agent has permission to modify a given object. The agent may have the permission for a given time range. In a local-first distributed system, all an agent needs to do to reclaim the permission to write to a given object is to write a time to a timestamp that is in the past and within the given window.

The entire notion of time in a distributed system needs some careful thought. Be careful! For the purpose of an object store, I would recommend avoiding it entirely. Merkle-clocks are already a much more useful as a history of modifications.

Metadata

In the context of a file system, this might be a good idea as it has a natural mapping to the MIME types in HTTP. However, most file systems do not store data this way and manage to work fine without it. Lots of file types use magic numbers for example.

Protobufs

Protobufs are generally deserialized with an object deserializer and thus require an O(n) operation. They do not permit random access. This might be fine for inode pointers and inode blocks but is not ideal for large objects. It would be preferable for object data to be able to be serialized with any serializer chosen by the user but especially serializers that permit random access such as capnproto or flatbuffers. From a protocol standpoint, the raw data of objects should be treated as arbitrary byte data. Note that the Data portion of the Protobuf schema above implies that even raw data blocks are Protobuf encoded if only to mark them as Raw.

HAMT

UnixFS uses a HAMT to list the blocks included in a file. The way this is implemented does not permit random access to blocks within the file. A better implementation would be a complete k-ary tree variant with fixed size blocks stored in order.

Note I assume because it’s called a HAMT that it’s organized by hash prefix. If I’m wrong about that, please correct me.

Design Goals

Random Access

If we are to consider truly large (huge) objects, we may not be able to download and store that object locally. The object itself may have internal structure permitting the calling system to index by location into the object and interact with a small piece of the data. We need random access.

Block Aligned

To construct flexible and fast query structures on top of IPLD, it would be nice for random access into IPLD data to be fast and work well with either the underlying filesystem or possibly with a raw block storage device. Implementors should choose IPFS/IPLD block sizes that either match the underlying block size or are even multiples of the underlying block sizes.

Content Addressed

We want to keep the content-addressed nature of IPFS along with all of its power.

We want to distinguish between object references and block references.

Correct Level of Abstraction

As IPLD developers, we do not want to think about object size, we want to think about object structure and semantic content.

Design Proposals

Variant 1

Construct a metadata structure similar to inode pointers and inode blocks to assemble a series of data blocks into a single object. Construct the structure in such a way that random access including random access downloading is possible. Following the lead of filesystems, maintain a complete k-ary tree of metadata blocks with links to data blocks.

For the metadata around the object, use Protobufs. Protobufs are fast, efficient, and supported in a huge array of languages. When used for small data, the absence of a serializing decoder is not a significant challenge.

For the raw data, treat it as a completely opaque byte array which gets broken into blocks. Choosing a serialization format is left entirely to the implementor.

Keep the structure as simple as possible. Omit metadata that implies rules that are controversial or could be specified elsewhere. Specifically, omit:

  • references to time including creation time, modification time, or (worst!) access time
  • references to file and/or data type (e.g. mime type, content type, file extension)
  • access control such as references to users or groups

Do NOT specify a maximum or fixed block size for the underlying system(s). It seems to me that the flexibility associated with having no fixed size is actually useful. The current soft-limit of 2MiB seems acceptable and there even seems to be some benefit to smaller block sizes. With this proposal, the implementation is free to download the ObjectNode and BlockTree and fetch many data blocks simultaneously. However, I do think that the ObjectNode should be restricted to a size like 1KiB. Typical choices for filesystem inode/dnodes are 128B to 512B. ObjectNode needs to be larger due to the large size of the CID addresses. Blocks containing either BlockIndexNodes or data nodes should be the current recommended max block size of 1MiB.

// Binary encoding of a CID
message Link {
	bytes CID = 1;
}

// A node in the n-ary Block Tree containing either (internal) references (CID)
// to other BlockIndexNodes or (leaves) references to data blocks.
// It is recommended to allocate BlockIndexNodes with the same size as 
// data blocks. Sizing and padding are left to the calling system.
message BlockIndexNode {
	// Link to either the data block or another level of BlockIndex nodes
	// Whether the node points to another BlockIndexNode or a data black
	// can be inferred from the calculated height of the tree.
	repeated Link BlockLink = 1;
}

// The ObjectNode is similar to an inode in Unix filesystems in that 
// it provides the starting point for locating and loading the data blocks
// that make up the object. The structure is created with the expectation that
// it will be stored in a fixed size block (e.g. 1KiB) where that size is
// specified externally. Padding to fill the block is left to the caller or
// to the filesystem on which this object is written.
message ObjectNode {

	// The total size of the object (just the data) in bytes (max 16EiB)
	uint64 Size = 1;
	
	// The (max) size of data block (max 2GiB). All blocks are expected
	// to be filled except for the final block. Padding the final data block
	// is left to the caller or to the underlying filesystem
	uint32 BlockSize = 2;

	// The total number of data blocks
	uint64 BlockCount = 3;

	// If BlockTree is not null, sets the number of links per level 
	// of the BlockTree
	optional uint32 Fanout = 4;

	// If links to data blocks cannot fit in DataBlocks,
	// this field stores a reference to the root of a BlockTree of type 
	// BlockIndexNode
	// NOTE: Mutually exclusive with DataBlocks and Data
	optional Link BlockTree = 5; 

	// If the number of Links to data blocks is such that they will fit entirely 
	// in the ObjectNode, store them in order in this array
	//
	// NOTE: Mutually exclusive with BlockTree and Data. 
	optional repeated Link DataBlocks = 6;

	// Variant 2 - store raw data in ObjectNode if there is enough
	// room left in the block size to store all the data
	// NOTE: mutually exclusive with BlockTree and DataBlocks
	optional bytes Data = 7;

}

// NOTE: there is no schema for the raw data itself since the data can
// be serialized with any serialization schema chosen by the implementer

Criticisms of Variant 1

The biggest criticism I can see is that, for small objects (of which most might be small), the entire object could fit into a single block even if that block is small. For example, lots of database records fit easily into a single 1KiB structure.

Due to the use of CIDs and large ints for BlockCount and BlockSize, we can’t use a small size for ObjectNode like the 128B - 512B common in filesystems.

The use of a repeated section for links is nice because the CIDs do not have a definite length. However, protobuf does not make it quick to randomly seek to an arbitrary array index. The fields are prefixed with a length so we do not have to scan the entire object. Fast enough?

The mutex of the DataBlocks, Data (Variant 2), and BlockTree fields could be a little confusing. I think it is a nice feature that these extra fields permit addressing both huge and tiny data.

There is no padding on any of the data structures. It is assumed that implementations will pad these structures. For example, if ObjectNodes are included in some sort of ObjectNode table with fixed size records, that they will be padded to the record length. For BlockIndexNodes and data nodes to be block aligned in the underlying storage, it is assumed that they will be automatically padded when copied into underlying filesystem blocks by the filesystem. Are we certain this is the case? Do we want to apply padding to the block size for each?

Variant 2

To address the small object dilemma, add an optional field in the ObjectNode for raw data only used when the raw data can fit entirely inside the ObjectNode.

References

[[1]] Cap’n Proto, FlatBuffers, and SBE
[[2]] UnixFS Spec
[[3]] The Second Extended File System
[[4]] ZFS On-Disk Specification
[[5]] Is there a provably optimal block (piece) size for torrents (and individual files)?
[[6]] File System (WNFS)
[[7]] a bittorrent filesystem

I am strongly in favor of increasing block size. Currently there is some momentum for serving scientific data using IPFS. IPFS is actually the perfect use-case for scientific data which is archived and distributed through very clunky repositories, DOIs, etc., all things that are solved very elegantly by IPFS.

One major benefit IPFS has over most repositories is that from these repositories you have to download the data before you can use them (I know of only one that offers S3-compatible object store to access the data) but with IPFS you can actually use the data directly without explicitly downloading them. The issue here is performance and because all of the new cloud ready data formats, such as Zarr, already come chunked, any additional chunking just produces unnecessary overhead.