@postables I am in the final stages of putting together a tool for de-duplication evaluation that can provide the answer to your question as well. It does all the work ipfs add
would, but in-memory and with a much more streamlined architecture allowing it to run extremely fast. If one has sufficiently beefy hardware, over 3GiB/s ingestion is not out of the question.
Tentatively I should have a version one could install and run “real soon now”.
The output looks kinda like this ( this is off my macbook, hence the relatively low speed )
~$ zstd -qdck test/data/large_repeat_5GiB.zst | bin/stream-dagger --legacy-ipfs-add-command="--cid-version=1"
{"type": "root", "size": 5368709120, "stream": 0, "cid":"bafybeia3kyhmzicrlqrnkwuq2i3rh443d7mxgbxof276taxvbol7ae6zja" }
Performed 92,654 read() syscalls into 302 distinct buffers
Streaming took 5.777 seconds at about 886.32 MiB/s
Processed a total of: 5,368,709,120 bytes
Forming DAG covering: 5,369,740,447 bytes across 20,599 nodes
Dataset would occupy: 164,626,432 bytes over 628 unique leaf data blocks
Linked as streams by: 1,031,327 bytes over 119 unique DAG-PB link blocks
Taking a grand-total: 165,657,759 bytes, 3.09% of original, 32.4x smaller
Counts\Sizes: 3% 10% 25% 50% 95% | Avg
{1} 1 L1: 6,147 | 6,147
118 L2: 8,710 8,710 8,710 8,710 8,710 | 8,687
628 DB: 262,144 262,144 262,144 262,144 262,144 | 262,144
Or with a modified linking strategy:
~$ zstd -qdck test/data/large_repeat_5GiB.zst | bin/stream-dagger --legacy-ipfs-add-command="--cid-version=1 --trickle"
{"type": "root", "size": 5368709120, "stream": 0, "cid":"bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe" }
Performed 96,909 read() syscalls into 302 distinct buffers
Streaming took 6.221 seconds at about 823.02 MiB/s
Processed a total of: 5,368,709,120 bytes
Forming DAG covering: 5,369,740,397 bytes across 20,598 nodes
Dataset would occupy: 164,626,432 bytes over 628 unique leaf data blocks
Linked as streams by: 1,031,277 bytes over 118 unique DAG-PB link blocks
Taking a grand-total: 165,657,709 bytes, 3.09% of original, 32.4x smaller
Counts\Sizes: 3% 10% 25% 50% 95% | Avg
{1} 1 L1: 9,343 | 9,343
12 L2: 8,710 8,710 8,710 8,918 9,127 | 8,914
47 L3: 8,710 8,710 8,710 8,710 8,918 | 8,774
58 L4: 8,710 8,710 8,710 8,710 8,710 | 8,665
628 DB: 262,144 262,144 262,144 262,144 262,144 | 262,144