As @lidel mentioned there’s a lot of prior art on much of this in Supporting Large IPLD Blocks. Would recommend reading more for context on both why the limits exist and what can be done about it.
Some highlights:
- In the IPFS context blocks are defined as the pile of bytes that you put through a hash function and get out a digest that you can put into a multihash. They are in general the minimally sized addressable chunk of data addressed in content addressable systems. You could define “block” differently but then the definition of “block size” would change too
.
- If you’re operating in a trusted/semi-trusted environment rather than a p2p one AFAICT there are no problems here and everything is already solved
, just hash your bytes with your favorite hash function, fetch the bytes and validate them to make sure there was no corruption.
- In p2p environments clients need to be able to get some proof that the 1EiB of data they’re downloading is actually what they’re looking for before downloading it. You can built trust over time, use merkle proofs, zk proofs, … but if your client doesn’t need these proofs then you’re likely already in a trusted/semi-trusted environment and you’re done
- Some hash functions like Blake3, KangarooTwelve, BitTorrent-v2 piece hash, … already have pretty obvious merkle proofs to use.
- The most commonly requested hash functions in IPFS-land for large blocks have traditionally been SHA-2 and SHA-1. There is a proposal for how to handle proofs for those large blocks. Unlike merklized hash functions like Blake3 you can’t fetch arbitrary slices, but if you looking to safely fetch and validate your large block it should be fine.
What are your pain points with blocks?
What use cases do you have that are currently suffering from size issues?
- The primary reason it’s really no fun to not have small blocks is an inability to be compatible with other content addressable data out there
- e.g. use IPFS tooling to address, find and fetch the large SHA-2 blocks that are present in basically every package manager
- Other reasons (e.g. I don’t want to advertise the middle blocks of zip files, I don’t want my database / index of multihash → bytes to include the middles of zip files,…) are largely solvable without touching this problem directly. For example, in BitTorrent files can have individual hashes but in practice are referenced by the hash of the identifier of the “collection” of objects.
What are cool things that you’re working on or using that help with big blobs?
See linked discussion post (also GitHub - aschmahmann/mdinc: Tooling for incremental verification of Merkle-Damgård construction hash functions (e.g. SHA2/3)). While these aren’t recent I’ve been doing some occasional hacking in the space and if there’s interest would be happy to revive, update, etc.
What would you like to see happening in the world?
I’d like to be able to have my Docker container layers, package manager dependencies, … that already have SHA-2 hashes in them be able to dynamically discover and do p2p retrieval of that data so things don’t break if a given registry goes down. Similar support for a merklized hash function (e.g. Blake3, BitTorrent-v2 pieces, etc.) would be great too since those support verifiable ranges.
Whatever other thoughts you have!
The CAR format is fairly simple, but has a number of pretty annoying failings that people have been doing their best to ignore or work around (e.g. lack of EOF making it difficult to know if an HTTP streamed CAR file has cleanly terminated or not, lack of meaningful description in the header of the content contained, etc.) it’s really very unprepared to handle sending proofs (e.g. BAO, etc.) unless they look like blocks (e.g. UnixFS merkle proofs).
Given existing work around shipping around bundles of blocks in CARs it seems likely that having a container format that supports these other types of proofs would be convenient for those who might want to validate the data being sent to them before downloading all of it. Note: again if you don’t have this problem it seems there’s very little left to do just use the CARs.
It may also be interesting to consider where this type of work fits in alongside support of something like webseeds in IPFS and the recent UnixFS profiles work since webseeds effectively is about separating out the file bytes from the proof bytes and allowing them to be downloaded from different sources. The concepts of outboard and combined BAOs are somewhat similar (e.g. outboard BAO → webseeds-like, combined BAO → CAR-like).
(Note: not a separate post because discourse tells me it wants single large posts )
See linked discussions this is not about UnixFS, it’s about working in a peer-to-peer rather than trusted or friend-to-friend environment. UnixFS enforces no such limit.
It seems to me like interop of some sort seems to be a goal here. The CAR format is a pretty basic primitive that you’re looking to use for interoperability between bluesky and a set of other IPFS storage providers that could handle that data.
CommP is a similarly sort of interesting case. It operates similarly to Blake3 or BitTorrent-v2 in that it’s a merkle-tree. Currently my understanding is that anyone who does full piece retrievals from Filecoin will not get an accompanying merkle-proof and that if they wanted to they’d currently have to come up with a new format similar to the Blake3 outboard and combined formats. If people are interested in making progress in this space it seems like the Filecoin folks involved in the recent PDP work might have some thoughts (cc @zenground0)
I think this is either missing context or is not correct. The security concerns are tied to proofs. Doing ranged requests into large blobs doesn’t help with anything if we have no associated proofs or additional trust assumptions. However, the security concerns go away if we can get the proofs separately (e.g. webseeds / outboard-like) or in-band with the data similar to how Trustless Gateway Specification for CAR requests with entity-bytes allows for fetching a byte range and the corresponding proofs