Working with Shards to Manage 2TB Dataset

Hey There!

We are working on speeding up our retrievals with IPFS on zarr data storages. When you have a 2TB dataset, the manifest file of the key value pairs gets to be around 160MB. Meaning just to read the dataset you needed to load 160MB. This was obviously not feasible. So we introduced a hamt structure which divided this manifest into “blocks” and “levels” making the first fetch being 160MB / 256 = 0.625KB. Then you have to traverse the levels. so it might need a couple round trips (caching along the way) but much better than 160MB at once. This works really well with random data and key value pairs. We were getting around 8MB/s on big datasets with our latest async improvements (needing to have more round trips) and around 22MB/s on smaller datasets (less round trips). But there was room for more improvements.

Because the data is structured through time I had this idea to shard the key values through time, and then because its a predictable structure, we could remove the key values reducing the manifest size. This makes the 160MB come down to ~80x2MB shards (maybe less because no keys required). And because its predictable, once you load one shard you have all the metadata in that time range compared to a hamt which was more randomized. This sharding doesn’t work though on random data which is why the hamt is still very useful for other datasets.

This brings me to the issue. I convert 3d arrays into 1d arrays for the shard. so [4,5,6] will convert down to a [180] index for example. So the shards are build like “baf…baf…baf…baf…”. This means that i can easily byte offset what I want. If i want the cid for [4,5,6] then i byte offset the shard by asking the gateway for index 180. meaning I only need to load “baf” from the 2MB shard. In parallel i load the shard though to minimize this for future requests.

In my brief tests on a small dataset with Shards, I found speeds of around 32MB/s. Because of the structure of shards this speed should theoretically also apply to datasets of 2TB datasets where before we had slower speeds of 8MB/s.

But we also want to make these datasets easily shareable. I want to do ipfs pin add ba.... on the root of the Shards and it will traverse the structure of all the shards and pin everything. But ba....ba...ba... is not natively supported.

What are your thoughts on adding this support to IPFS?

You can inspect the structure here:
Top Level:
https://ipfs-gateway.dclimate.net/ipfs/bafyr4idii6rt5qlnj2c7z67t5g75ml3bs2woevmocu4fwijppk6ap2p3bm/
Shard (auto downloads):
https://ipfs-gateway.dclimate.net/ipfs/bafkr4ieg3unhzavj23wtionrpv4lkcdjge7ncapi5zy54gp26fsphufirm/

I can make the top level traversable. But not the shard level yet. You will see alot of 000000s. This is because the shard isn’t full and I did this to ensure byte offsetting so cids can be inserted everywhere and still maintain the structure.

Would IPLD selectors or Graphsync would be useful here?

1 Like