Deduplication Ratio

Could u please tell me the deduplication ratio of the IPFS?
I mean the overall deduplication ratio of the entire system.

There is no deduplication ratio of the entire system.

Sorry, I’m a little confused.
Do you mean that each file has its own deduplication ratio?

No I mean, that IPFS just doesn’t track any deduplication.
A file can be duplicated from 0 to ∞ and IPFS just doesn’t care.

OK, got it! Thanks for your reply.

1 Like

If the same file is uploaded multiple times with the same chunking algorithm the chunks ARE fully de-duplicated. That means even if a large number of people upload that same file, only one “file” will be “stored”. It’s stored chunk by chunk and the deduplication is 100%. i.e. no duplicate data is stored. (although yes of course multiple different servers might hold copies perhaps…but all those copies are the same CIDs pointing to the same chunks, with no duplicates)

So chunks could be deduplicated, if aligned similarly?

They’re deduplicated if they’re aligned exactly the same and use the same hashing function. The deducing story for IPFS is a bit confusing. IPFS deduplicates at the node level but duplicates across nodes. It doesn’t so much deduplicate as it stores based on content so if you go to store something a second time it just says, “nope, we’re good. Already got it”. When you go to pin something you’re duplicating content across nodes.

The deduping with IPFS is based on hash and content not semantics. If you add a file using two different hashes it will be stored twice. An identical file in two different file formats will be stored twice. If you had an html version and a text version of a file you might get some deduplication if you get lucky and the file can be broken into chunks where the two files share identical chunks of text.

There’s also the added confusion of when you add a file. When you add a file to IPFS you’ll have two copies of the fie. The original file and the file that has been hashed and chunked in IPFS. This can be a problem if you’re storing a large amount of data in IPFS. In this case you can use the IPFS filestore. When using that IPFS stores pointers to the file system and you don’t get the two copies but if you move the file IPFS won’t be able to find it and there is no deduplication.

Thanks for your elaboration, this makes sense. In my opinion sometimes IPFS reinvents the wheel, while in other situations it should build better on the shoulders of giants. For example; IPFS and ZFS (and I assume BTRFS too) would be a good companions. But many operations between IPFS and ZFS are duplicated, better integration for ‘heavy-nodes’ could significantly reduce overhead.

For example; IPFS and ZFS (and I assume BTRFS too) would be a good companions

Yes : Make IPFS reflink aware, dedup file storage between IPFS and user downloaded files ¡ Issue #8201 ¡ ipfs/kubo ¡ GitHub :slight_smile:

However I think reinventing the wheel is a good thing sometime, for example in this context if IPFS would just rely on ZFS or BTRFS to dedup files, that doesn’t work on windows, doesn’t work on macos, doesn’t work in the browser.
Plus thoses linux FS dedup things only work in specific cases (such as using the kernel accelerated copy syscalls) which few tools supports.
Plus thoses FS doesn’t support advanced costy dedupping technics suchs as buzzhash.

I think there may be a chunker (not sure if it’s the default) that uses some kind of statistical analysis to make chunking result in a very high number of duplicates (high deduping factor) happening, even in cases where perhaps a large file has a single block of bytes added to the front of it which would theoretically make every chunk hash differently. But with this special algo/chunker it will still divide up the data so that dedup happens a lot.

UPDATE: I found the old discussion where I learned about his: