IPFS and deduplication

zacharywhitley · June 4, 2022, 12:06pm

Someone asked me a question about using IPFS for deduplication that had me scratching my head a little bit. Deduplication gets mentioned a lot when discussing IPFS and I understand how it works. Their question was along the lines of, “I have two sets of images and I want to create a set with the duplicates removed”. Sure you could just say, “Just add it to IPFS and it will be deduplicated” but it’s the storage that will be deduplicated, there will still be multiple references. I think what they’re asking for is deduplicated references and while that wouldn’t be terribly difficult I couldn’t think of any simple way to do it.

Are there any easy ways to do this? Seems like ipfs object diff might be a possibility.

Jorropo · June 4, 2022, 12:20pm

It automatically happen when two blocks have the same hashes. You rarely have to do anything.

So first time it sees the image, it hash the blocks don’t find them in the blockstore and copy them.
Second time it hash the blocks see they are already stored and do nothing.

Note this only work at the block level, files first need to be chunked into blocks and that has different parameters. For example if you add the same image twice but with different block sizes (one 256KiB second 1MiB) they wont be deduped, there is options in the chunking process that also does this.

zacharywhitley · June 4, 2022, 1:04pm

That’s what I meant by “under the hood”, but say someone comes and says, “I put together 5 separate image datasets but I’m pretty sure there’s a lot of duplicated images. Same images with different names. I want to combine them into a single image dataset without the duplicates. I heard IPFS does deduplication and I plan on putting it in there anyway. Can I use IPFS to do the deduplication?”

When I mean without the duplicates I mean a list or directory containing of all the unique files in the two image sets.

Sure I could throw together a quick bash script to do it but I couldn’t think of a great way to do it in IPFS.

Jorropo · June 4, 2022, 1:14pm

names are not parts of files they are parts of folders.
Files are just the data, and the folders list names + sub CID:
Kinda like this:

[
  {"name":"a.png","hash": "Qmfoo"},
  {"name":"b.png","hash": "Qmfoo"},
  {"name":"c.png","hash": "Qmfoo"}
]

For example, this is how the internal of a folders looks like (in reality there is a few more metadata and it’s protobuf not json), it has 3 files a.png, b.png and c.png. And since they have the same CID, they are dedupped.

zacharywhitley · June 4, 2022, 1:38pm

Got it but there can be multiple files with different names but the same CID what I’m asking is if the user wanted to create a directory or list of all the unique CID’s. ie. I’ve got everything in IPFS ,some in folder A and some in folder B. Now I want to make a folder C with all the unique CIDs. I know plenty of ways to do it be they all seem a bit hackish and was wondering if there was some elegant way to do it in IPFS.

Jorropo · June 4, 2022, 2:03pm

oh you want an elegant way

Try MFS

ipfs files --help

This is a virtual space where you can “edit” folders.
So you can “copy” (which just adds a new entry in the list I’ve shown)

ipfs files cp /ipfs/Qmfoo /new/path/in/mfs

Then you can export your finished work with ipfs files stat /path/in/mfs and that gonna give you a CID of the current state.

zacharywhitley · June 4, 2022, 2:09pm

I was thinking of that but forgot that if you don’t include the destination name it will use the cid for the file name. (I found that a bit confusing at first when using MFS). Isn’t there a way to delay updating parent CIDs when doing that? Something along the line of “I’m going to be adding a bunch of files . You can skip updating the parent CID’s till I"m done” I vaguely remember reading about a feature that did that.

Jorropo · June 4, 2022, 2:20pm

No it can’t yet but I’m not sure that an issue.

(except the big performance issue)

You just work your way down up, you first create your final files / folders and then the folders containing thoses folders, and the things before that, …

Topic		Replies	Views
IPFS and file deduplication	3	2428	September 27, 2019
Deduplication Ratio Help js-ipfs , go-ipfs , files	10	709	August 27, 2021
Does "ipfs add" duplicate content? Help	10	4398	May 19, 2017
New to IPFS, would love some help clarifying a few things Help	5	657	March 19, 2021
How does IPFS handle redundancy? Help	3	1774	May 20, 2017

IPFS and deduplication

Related topics