How to measure how many blocks are shared between CIDs?

meermanr · June 8, 2024, 12:49pm

I’d like to use a private IPFS swarm at work to distribute large disk images (~6-10 GiB) between build systems in the cloud and lab equipment in our various offices. I believe that the content of these disk images is largely identical between builds (perhaps 80% unchanged), and so hope that IPFS will save us a lot of network transfers and reduce both time and cost to distribute these images.

I want to maximise deduplication between these similar images, but I’d also like to compress them. I noticed that zstd has an --rsyncable option and lets you configure the blocksize (with -B) so I am hoping that if I match IPFS’s block size of 256kB I might be able to have my cake and eat it.

How can I tell how many blocks are shared between two (or more) CIDs? Are there any tools which can help with this, that can cope with such large images?

For context and completeness, this is the method I’ve been using to date, but I’ve little confidence in this approach. Please point out anything I am missing.

ipfs refs ${CID_A} > a.txt
ipfs refs ${CID_B} > b.txt
cat a.txt b.txt | sort | uniq -c | sort -rn | head

Which gives me results like this:

# cat a b | sort | uniq -c | sort -rn | head
 164 bafyb4igsmsho23lsnstheptnxehtrsxag6dejrleftw3mx4dpkjeyw5xvq
   2 bafyb4iew4jo7ncdmfg2omzswignrflkj2nekx6b43uwwtew47m252frj7u
   2 bafyb4icj2iqw3aew5vp6ap4dnfqbvhirdsipanlm3h7q72waelcsoliiiu
   2 bafyb4ibtdmszworl3g4pf5n5gpinsb6lrppgea3xpryoofwjzdcykpteha
   2 bafyb4iar4rrgkliooqvsdihh5vs3cin2pwv4jqrdun7yor26pmysiylw2a
   1 bafyb4ihynsbm6w6mjlcdanpxl5wfhypvbe4tb7khdknui3f2dh2tpuqwsq
   1 bafyb4ihyfusgjxh2ztwi4dfwli6eafqj4yrtnj5tugkeot6vecndustl7u
   1 bafyb4ihxaivofhs7tqz2vyebxtcjb75d6jv7pazyiekzl2ht3wusucanau
   1 bafyb4ihqmz4hcdgwcbghjusx4fgf6pdoq7ptn2oteh4flvtqsw5zztuhs4
   1 bafyb4ihpylrhg3aesspl2v6ume3k2b7p2ps554t3paduu2pmas3duou36y

This gives me a rough indication, but I have my doubts about converting this into, say, a percentage of the two files that overlap. I mean, each of these refs could be a subtree, right?

The way I’ve been checking the size of a block is to use ipfs cat, and reading the total size from that:

# ipfs cat bafyb4igsmsho23lsnstheptnxehtrsxag6dejrleftw3mx4dpkjeyw5xvq > /dev/null
 43.50 MiB / 43.50 MiB [==========================================================] 100.00% 0s

So my recipe currently is:

# Dump all blocks referenced for original (old) file
# ipfs refs --recursive bafyb4ihlhd33yczkyehbvt5hoiyyao6m6reqiyqgendho2himmf4krhupq > a

# .. and new file
# ipfs refs --recursive bafyb4ibo7d2bywtaszu5q2pf25eat4akley4ygb7u2tb2gwxibdxtpq4ga > b

# How many blocks are there, total, in both files?
# cat a b | wc -l
   53876

# How many distinct blocks are there?
# cat a b | sort | uniq -c | wc -l
   14938

# .. and how many of them are unique (referenced exactly once)?
# cat a b | sort | uniq -c | grep '^ *1 ' | wc -l
    9178

# So what's the dedupe?
# (53876 - 9178) / 53876
0.8296458534412354

Is there a better way?

Jorropo · June 8, 2024, 3:06pm

ipfs dag stat ${CID_A} ${CID_B}

Topic		Replies	Views
How to get all chunks'cid of a file larger than 43.5MiB from IPFS? Help go-ipfs	2	48	September 26, 2024
Why does the same file result in different sha256 in cid? Help go-ipfs	4	537	June 14, 2022
Can I get the size of the file before downloading it throught IPFS? Help	3	257	October 5, 2022
.ipfs/blocks enlarged to 2TB despite MaxStorage setting, discovering cid to ds key conversion Help files , datastore	5	353	March 8, 2024
Large files question about duplication Help	1	692	May 23, 2017

How to measure how many blocks are shared between CIDs?

Related topics