Working with Shards to Manage 2TB Dataset

Hey There!

We are working on speeding up our retrievals with IPFS on zarr data storages. When you have a 2TB dataset, the manifest file of the key value pairs gets to be around 160MB. Meaning just to read the dataset you needed to load 160MB. This was obviously not feasible. So we introduced a hamt structure which divided this manifest into “blocks” and “levels” making the first fetch being 160MB / 256 = 0.625KB. Then you have to traverse the levels. so it might need a couple round trips (caching along the way) but much better than 160MB at once. This works really well with random data and key value pairs. We were getting around 8MB/s on big datasets with our latest async improvements (needing to have more round trips) and around 22MB/s on smaller datasets (less round trips). But there was room for more improvements.

Because the data is structured through time I had this idea to shard the key values through time, and then because its a predictable structure, we could remove the key values reducing the manifest size. This makes the 160MB come down to ~80x2MB shards (maybe less because no keys required). And because its predictable, once you load one shard you have all the metadata in that time range compared to a hamt which was more randomized. This sharding doesn’t work though on random data which is why the hamt is still very useful for other datasets.

This brings me to the issue. I convert 3d arrays into 1d arrays for the shard. so [4,5,6] will convert down to a [180] index for example. So the shards are build like “baf…baf…baf…baf…”. This means that i can easily byte offset what I want. If i want the cid for [4,5,6] then i byte offset the shard by asking the gateway for index 180. meaning I only need to load “baf” from the 2MB shard. In parallel i load the shard though to minimize this for future requests.

In my brief tests on a small dataset with Shards, I found speeds of around 32MB/s. Because of the structure of shards this speed should theoretically also apply to datasets of 2TB datasets where before we had slower speeds of 8MB/s.

But we also want to make these datasets easily shareable. I want to do ipfs pin add ba.... on the root of the Shards and it will traverse the structure of all the shards and pin everything. But ba....ba...ba... is not natively supported.

What are your thoughts on adding this support to IPFS?

You can inspect the structure here:
Top Level:
https://ipfs-gateway.dclimate.net/ipfs/bafyr4idii6rt5qlnj2c7z67t5g75ml3bs2woevmocu4fwijppk6ap2p3bm/
Shard (auto downloads):
https://ipfs-gateway.dclimate.net/ipfs/bafkr4ieg3unhzavj23wtionrpv4lkcdjge7ncapi5zy54gp26fsphufirm/

I can make the top level traversable. But not the shard level yet. You will see alot of 000000s. This is because the shard isn’t full and I did this to ensure byte offsetting so cids can be inserted everywhere and still maintain the structure.

Would IPLD selectors or Graphsync would be useful here?

3 Likes

Hi, thanks for sharing this writeup of your needs and use case. I’m tagging @lidel, a kubo maintainer who has been thinking about range requests recently, and @vmx as an IPLD expert.

@philvms mind clarifying what the specific ask is?

Do you mean ability to better sharding controls in Kubo?
Or adding support for multiple unrelated roots being passed to ipfs pin add in one batch?
Or improving retrieval performance of 2TB sharded dataset?
Or improving ergonomics of sharing sharded data, where users does not want the entire 2TB?

Below are some thoughts on everything you’ve touched upon, lmk if I missed your point.

On pin add accepting multiple roots

ipfs pin add, just like virtually all other commands, accepts a single CID/DAG.

If you want to pin multiple DAGs, you call it multiple times:

for cid in "cid1 cid2 cid3"; do ipfs pin add $cid; done

Following the unix philosophy of small composable commands, it is better if users handle calling it multiple times and decide on batching and error handling that fits their use case, than Kubo maintainers making opinionated choices and having to be responsible for more code/bugs.

If I missed the point and there is more user benefit here, feel free to fill feature request via https://github.com/ipfs/kubo/issues/new/ so we can triage it properly.

On pinning your manually created dag-cbor and shards

Kubo’s ipfs pin add -r operates on CID and you can pin an entire DAG recursively (-r) as long as the root CID you pin can be traversed.

Your example looks.. fine?

I think if your goal is to have “easily shareable subset of shards” then you can either

  • ask users to pin root of each shard separately (see bash one-liner above)
    • you can also include the top level root by pinning it without doing the recursive walk ( ipfs pin add --recursive=false cid)
  • or create dag-cbor manifest for a subset of data that links to the shards you want, and recursively pin that single root CID.

On Graphsync

If the goal is interop with wider ecosystem, probably not?

  • Graphsync, seems to not be actively maintained anymore, most of IPFS stack removed the experimental support for it (e.g. Kubo, IPNI).
  • IPLD Selectors are not exposed in end user tooling due to complexity and risks. Only basic “selector-like” knobs ipfs pin’s --recursive=[true|false] or IPIP-402’s dag-scope and entity-bytes are exposed.

Of course you can choose to still use Graphsync if you control both client and server, but YMMV with fixing issues + it won’t be supported by majority of third-party peers on IPFS Mainnet.

If you are looking for something future-proof, it likely means sticking with codec-agnostic block-based protocols like Bitswap or HTTP Retrieval (over HTTP/2 for multiplexing).

PS. Manual vs built-in automatic sharding and trade-offs

Directory/index sharding (be it manual like the one you did with dag-cbor and raw blocks, or automatic, like one provided by UnixFS’ Directory HAMT with dag-pb) always creates some type of a DAG built with “blocks”.

There is no silver bullet when it comes to “optimal DAG shape”, and if default is not good enough, specific use case needs to figure out best settings for their users (trading DAG width vs number of blocks/bytes user needs to fetch to find data in DAG vs number of blocks to change and re-fetch on leaf updates).

Happy you’ve found optimized time-based manual dag-cbor approach that works for your use case. Just make sure each dag-cbor block does not get bigger than 1-2MiB so it can remain transferable over Bitswap.

Not sure if useful, but perhaps for someone reading this in the future it may also be good to look at recently added Kubo’s Import.Unixfs* config options, that allow adjusting width/height of DAGs automatically created by ipfs add (including big automatically chunked files and automatic HAMT-sharding of big UnixFS directories).

Thanks for the response!

Almost what I meant. You stop right where I wanted to take it to the next level. In the example I gave for bafkr4ieg3unhzavj23wtionrpv4lkcdjge7ncapi5zy54gp26fsphufirm is actually just “bafk…bafk…bafk…bafk” in the raw data.

But going back into the details i realize i was ascii coding it so taking 59 bytes instead of the more efficient 36 bytes. dag cbor would only make it take 40 bytes. In the end of i just lowered the shard size to have 6250 cids with dag-cbor encoding. This would allow more parallelization and this way ipfs pin add will recursively pin everything. It would be cool though to allow packing of cids with dag-cbor support. From my research dag cbor has some leading bytes to tell the length and doesn’t really support filler cids? this prevents us doing byte offsets (which i suppose is fine for now).

An example of dagcbor-packed would be where you have “bafy…bafy…bafy…bafy…bafy0000…bafy…bafy…” where the bafy0000 is like a null value of 36 bytes making it super predictable on how to fetch byte offsets if i want the 500th cid of a 5000 list of cids.

1 Like