I’d like to use a private IPFS swarm at work to distribute large disk images (~6-10 GiB) between build systems in the cloud and lab equipment in our various offices. I believe that the content of these disk images is largely identical between builds (perhaps 80% unchanged), and so hope that IPFS will save us a lot of network transfers and reduce both time and cost to distribute these images.
I want to maximise deduplication between these similar images, but I’d also like to compress them. I noticed that zstd
has an --rsyncable
option and lets you configure the blocksize (with -B
) so I am hoping that if I match IPFS’s block size of 256kB I might be able to have my cake and eat it.
How can I tell how many blocks are shared between two (or more) CIDs? Are there any tools which can help with this, that can cope with such large images?
For context and completeness, this is the method I’ve been using to date, but I’ve little confidence in this approach. Please point out anything I am missing.
ipfs refs ${CID_A} > a.txt
ipfs refs ${CID_B} > b.txt
cat a.txt b.txt | sort | uniq -c | sort -rn | head
Which gives me results like this:
# cat a b | sort | uniq -c | sort -rn | head
164 bafyb4igsmsho23lsnstheptnxehtrsxag6dejrleftw3mx4dpkjeyw5xvq
2 bafyb4iew4jo7ncdmfg2omzswignrflkj2nekx6b43uwwtew47m252frj7u
2 bafyb4icj2iqw3aew5vp6ap4dnfqbvhirdsipanlm3h7q72waelcsoliiiu
2 bafyb4ibtdmszworl3g4pf5n5gpinsb6lrppgea3xpryoofwjzdcykpteha
2 bafyb4iar4rrgkliooqvsdihh5vs3cin2pwv4jqrdun7yor26pmysiylw2a
1 bafyb4ihynsbm6w6mjlcdanpxl5wfhypvbe4tb7khdknui3f2dh2tpuqwsq
1 bafyb4ihyfusgjxh2ztwi4dfwli6eafqj4yrtnj5tugkeot6vecndustl7u
1 bafyb4ihxaivofhs7tqz2vyebxtcjb75d6jv7pazyiekzl2ht3wusucanau
1 bafyb4ihqmz4hcdgwcbghjusx4fgf6pdoq7ptn2oteh4flvtqsw5zztuhs4
1 bafyb4ihpylrhg3aesspl2v6ume3k2b7p2ps554t3paduu2pmas3duou36y
This gives me a rough indication, but I have my doubts about converting this into, say, a percentage of the two files that overlap. I mean, each of these refs could be a subtree, right?
The way I’ve been checking the size of a block is to use ipfs cat
, and reading the total size from that:
# ipfs cat bafyb4igsmsho23lsnstheptnxehtrsxag6dejrleftw3mx4dpkjeyw5xvq > /dev/null
43.50 MiB / 43.50 MiB [==========================================================] 100.00% 0s
So my recipe currently is:
# Dump all blocks referenced for original (old) file
# ipfs refs --recursive bafyb4ihlhd33yczkyehbvt5hoiyyao6m6reqiyqgendho2himmf4krhupq > a
# .. and new file
# ipfs refs --recursive bafyb4ibo7d2bywtaszu5q2pf25eat4akley4ygb7u2tb2gwxibdxtpq4ga > b
# How many blocks are there, total, in both files?
# cat a b | wc -l
53876
# How many distinct blocks are there?
# cat a b | sort | uniq -c | wc -l
14938
# .. and how many of them are unique (referenced exactly once)?
# cat a b | sort | uniq -c | grep '^ *1 ' | wc -l
9178
# So what's the dedupe?
# (53876 - 9178) / 53876
0.8296458534412354
Is there a better way?