IPFS and deduplication

Someone asked me a question about using IPFS for deduplication that had me scratching my head a little bit. Deduplication gets mentioned a lot when discussing IPFS and I understand how it works. Their question was along the lines of, “I have two sets of images and I want to create a set with the duplicates removed”. Sure you could just say, “Just add it to IPFS and it will be deduplicated” but it’s the storage that will be deduplicated, there will still be multiple references. I think what they’re asking for is deduplicated references and while that wouldn’t be terribly difficult I couldn’t think of any simple way to do it.

Are there any easy ways to do this? Seems like ipfs object diff might be a possibility.

It automatically happen when two blocks have the same hashes. You rarely have to do anything.

So first time it sees the image, it hash the blocks don’t find them in the blockstore and copy them.
Second time it hash the blocks see they are already stored and do nothing.

Note this only work at the block level, files first need to be chunked into blocks and that has different parameters. For example if you add the same image twice but with different block sizes (one 256KiB second 1MiB) they wont be deduped, there is options in the chunking process that also does this.

That’s what I meant by “under the hood”, but say someone comes and says, “I put together 5 separate image datasets but I’m pretty sure there’s a lot of duplicated images. Same images with different names. I want to combine them into a single image dataset without the duplicates. I heard IPFS does deduplication and I plan on putting it in there anyway. Can I use IPFS to do the deduplication?”

When I mean without the duplicates I mean a list or directory containing of all the unique files in the two image sets.

Sure I could throw together a quick bash script to do it but I couldn’t think of a great way to do it in IPFS.

names are not parts of files they are parts of folders.
Files are just the data, and the folders list names + sub CID:
Kinda like this:

[
  {"name":"a.png","hash": "Qmfoo"},
  {"name":"b.png","hash": "Qmfoo"},
  {"name":"c.png","hash": "Qmfoo"}
]

For example, this is how the internal of a folders looks like (in reality there is a few more metadata and it’s protobuf not json), it has 3 files a.png, b.png and c.png. And since they have the same CID, they are dedupped.

Got it but there can be multiple files with different names but the same CID what I’m asking is if the user wanted to create a directory or list of all the unique CID’s. ie. I’ve got everything in IPFS ,some in folder A and some in folder B. Now I want to make a folder C with all the unique CIDs. I know plenty of ways to do it be they all seem a bit hackish and was wondering if there was some elegant way to do it in IPFS.

oh you want an elegant way :smiley:

Try MFS

ipfs files --help

This is a virtual space where you can “edit” folders.
So you can “copy” (which just adds a new entry in the list I’ve shown)

ipfs files cp /ipfs/Qmfoo /new/path/in/mfs

Then you can export your finished work with ipfs files stat /path/in/mfs and that gonna give you a CID of the current state.

I was thinking of that but forgot that if you don’t include the destination name it will use the cid for the file name. (I found that a bit confusing at first when using MFS). Isn’t there a way to delay updating parent CIDs when doing that? Something along the line of “I’m going to be adding a bunch of files . You can skip updating the parent CID’s till I"m done” I vaguely remember reading about a feature that did that.

No it can’t yet but I’m not sure that an issue.

(except the big performance issue)

You just work your way down up, you first create your final files / folders and then the folders containing thoses folders, and the things before that, …