Could u please tell me the deduplication ratio of the IPFS?
I mean the overall deduplication ratio of the entire system.
There is no deduplication ratio of the entire system.
Sorry, Iâm a little confused.
Do you mean that each file has its own deduplication ratio?
No I mean, that IPFS just doesnât track any deduplication.
A file can be duplicated from 0 to â and IPFS just doesnât care.
OK, got it! Thanks for your reply.
If the same file is uploaded multiple times with the same chunking algorithm the chunks ARE fully de-duplicated. That means even if a large number of people upload that same file, only one âfileâ will be âstoredâ. Itâs stored chunk by chunk and the deduplication is 100%. i.e. no duplicate data is stored. (although yes of course multiple different servers might hold copies perhapsâŚbut all those copies are the same CIDs pointing to the same chunks, with no duplicates)
So chunks could be deduplicated, if aligned similarly?
Theyâre deduplicated if theyâre aligned exactly the same and use the same hashing function. The deducing story for IPFS is a bit confusing. IPFS deduplicates at the node level but duplicates across nodes. It doesnât so much deduplicate as it stores based on content so if you go to store something a second time it just says, ânope, weâre good. Already got itâ. When you go to pin something youâre duplicating content across nodes.
The deduping with IPFS is based on hash and content not semantics. If you add a file using two different hashes it will be stored twice. An identical file in two different file formats will be stored twice. If you had an html version and a text version of a file you might get some deduplication if you get lucky and the file can be broken into chunks where the two files share identical chunks of text.
Thereâs also the added confusion of when you add a file. When you add a file to IPFS youâll have two copies of the fie. The original file and the file that has been hashed and chunked in IPFS. This can be a problem if youâre storing a large amount of data in IPFS. In this case you can use the IPFS filestore. When using that IPFS stores pointers to the file system and you donât get the two copies but if you move the file IPFS wonât be able to find it and there is no deduplication.
Thanks for your elaboration, this makes sense. In my opinion sometimes IPFS reinvents the wheel, while in other situations it should build better on the shoulders of giants. For example; IPFS and ZFS (and I assume BTRFS too) would be a good companions. But many operations between IPFS and ZFS are duplicated, better integration for âheavy-nodesâ could significantly reduce overhead.
For example; IPFS and ZFS (and I assume BTRFS too) would be a good companions
However I think reinventing the wheel is a good thing sometime, for example in this context if IPFS would just rely on ZFS or BTRFS to dedup files, that doesnât work on windows, doesnât work on macos, doesnât work in the browser.
Plus thoses linux FS dedup things only work in specific cases (such as using the kernel accelerated copy syscalls) which few tools supports.
Plus thoses FS doesnât support advanced costy dedupping technics suchs as buzzhash.
I think there may be a chunker (not sure if itâs the default) that uses some kind of statistical analysis to make chunking result in a very high number of duplicates (high deduping factor) happening, even in cases where perhaps a large file has a single block of bytes added to the front of it which would theoretically make every chunk hash differently. But with this special algo/chunker it will still divide up the data so that dedup happens a lot.
UPDATE: I found the old discussion where I learned about his: