UnixFS Object Overhead

UnixFS isn’t just a straight representation of whatever data you are adding. For example if i’m adding a cat picture to IPFS as a UnixFS object, there will be “extra data” that is generated, namely metadata, and object structure data (field names, links, etc…)

Is there any way to reliably determine what the overhead will be? For example if I’m adding a 10MB cat picture, with 256kb chunk size, and the balanced chunker, is there any way to extrapolate and calculate how much this overhead will be in terms of extra data generated?

In practice it doesn’t seem like much, maybe, 1-3% but I’m wondering if anyone has thought this through more thoroughly.

You cannot determine the overhead, this depends on the underlaying blockstore (and if flatfs is used, the underlaying filesystem).

Well, ipfs is in this respect similar to a regular filesystem, there’s always some overhead involved.

While storing files on a filesystem nobody cares about the overhead, as long as it isn’t existing or limiting the usability (like the inodes of extX).

So unixfs is a protobuf structure: https://github.com/ipfs/go-unixfs/blob/master/pb/unixfs.proto which is usually (unless raw leaves etc) wrapped by a DAG PB (https://github.com/ipfs/go-merkledag/blob/master/pb/merkledag.proto).

You can do the math and figure how big the overhead is by how big the protobufs object will be (so data they carry + size of link names + size of link CIDs etc).

@RubenKelevra well I think that would be true if storing stuff on IPFS was 1:1 storing on regular filesystems. But I don’t think that’s quite the case unless you were like, just serializing your data into bytes and storing the bytes into a key-value store directly without creating IPFS objects.

While storing files on a filesystem nobody cares about the overhead, as long as it isn’t existing or limiting the usability (like the inodes of extX).

Well for example lets say for every 10MB of data you store, there’s 1KB of overhead and you store 10GB of data, there would be (if my awful math is correct) 10MB of “overhead”. Scaling this up to petabytes of data would have very real implications that I think certain people will care about.

@hector

Ah perfect that was exactly what I was looking for thanks.

Well, there’s no serialization going on (when you’re using raw-leaves) and no key-value-store when you’re using flatfs (the default).

1 Like

This is very true but there are so many ways to use IPFS, whether it be using go-ipfs directly, ipfs-lite, or even just constructing your nodes programmatically via libp2p and such that I think it’s good to at least be aware of possible overhead

@postables I am in the final stages of putting together a tool for de-duplication evaluation that can provide the answer to your question as well. It does all the work ipfs add would, but in-memory and with a much more streamlined architecture allowing it to run extremely fast. If one has sufficiently beefy hardware, over 3GiB/s ingestion is not out of the question.

Tentatively I should have a version one could install and run “real soon now”.

The output looks kinda like this ( this is off my macbook, hence the relatively low speed )

~$ zstd -qdck test/data/large_repeat_5GiB.zst | bin/stream-dagger --legacy-ipfs-add-command="--cid-version=1"
{"type":   "root",   "size": 5368709120, "stream":     0, "cid":"bafybeia3kyhmzicrlqrnkwuq2i3rh443d7mxgbxof276taxvbol7ae6zja" }

Performed 92,654 read() syscalls into 302 distinct buffers
Streaming took 5.777 seconds at about 886.32 MiB/s
Processed a total of:  5,368,709,120 bytes

Forming DAG covering:  5,369,740,447 bytes across 20,599 nodes
Dataset would occupy:    164,626,432 bytes over 628 unique leaf data blocks
Linked as streams by:      1,031,327 bytes over 119 unique DAG-PB link blocks
Taking a grand-total:    165,657,759 bytes, 3.09% of original, 32.4x smaller
Counts\Sizes:          3%       10%       25%       50%       95% |      Avg
{1}     1 L1:                                               6,147 |    6,147
      118 L2:       8,710     8,710     8,710     8,710     8,710 |    8,687
      628 DB:     262,144   262,144   262,144   262,144   262,144 |  262,144

Or with a modified linking strategy:

~$ zstd -qdck test/data/large_repeat_5GiB.zst | bin/stream-dagger --legacy-ipfs-add-command="--cid-version=1 --trickle"
{"type":   "root",   "size": 5368709120, "stream":     0, "cid":"bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe" }

Performed 96,909 read() syscalls into 302 distinct buffers
Streaming took 6.221 seconds at about 823.02 MiB/s
Processed a total of:  5,368,709,120 bytes

Forming DAG covering:  5,369,740,397 bytes across 20,598 nodes
Dataset would occupy:    164,626,432 bytes over 628 unique leaf data blocks
Linked as streams by:      1,031,277 bytes over 118 unique DAG-PB link blocks
Taking a grand-total:    165,657,709 bytes, 3.09% of original, 32.4x smaller
Counts\Sizes:          3%       10%       25%       50%       95% |      Avg
{1}     1 L1:                                               9,343 |    9,343
       12 L2:       8,710     8,710     8,710     8,918     9,127 |    8,914
       47 L3:       8,710     8,710     8,710     8,710     8,918 |    8,774
       58 L4:       8,710     8,710     8,710     8,710     8,710 |    8,665
      628 DB:     262,144   262,144   262,144   262,144   262,144 |  262,144

Thanks for the response, if you need some testing I’ve got like 72 CPU cores I can dump at this when you have a v1 out :grin:

What you’re working on 100000% seems like exactly what I am looking for in terms of a tool for this kind of stuff, as well as other deduplication research. Great stuff :rocket: