What would be the best way to check how much deduping is going on after having added a directory to IFPS?
What do you mean by dedupping ?
How much smaller the storage is in IPFS because of duplicate data in the source where the data was added from. I’m adding a directory with 60Gb of data and am interested in how much much of it hashed to the same blocks.
I’m adding a directory with 60Gb of data and am interested in how much much of it hashed to the same blocks.
By default IPFS use the most dumber chunker possible and doesn’t try to dedup, it can happen but if does it’s mostly by luck (only reliable works for files with the same alligned content).
If you want to actively dedup you can try buzzhash while adding (but note that buzzhash is only trying on a per file basis not globaly).
ipfs add --chunker=buzzhash example.txt
Else to know about how much deduping is done you can view:
ipfs dag stat QmExample
That will give you the on IPFS size (counting duplicates blocks only once). So you compare that with your on filesystem size and will know the difference (note you might see a slightly bigger IPFS size for files that have no repeating parts that the chunker found, that normal that is because directories and files objects (roots) are bigger than your FS equivalents (inodes)).
Thanks. That’s just what I was looking for.
I was thinking that the deduplication story for IPFS is a bit suboptimal and I know how it works but I’m just thinking out loud here. It’s unfortunate that you have to basically run it twice with different chunking settings just to tell if it would be beneficial to use a more costly chunker but if you’re going to be running buzhash to even see if it would help you’re already paid the computational cost so is there any reason to use the default chunker after that?
It might be nice if maybe you could do some sort of sampling when using the default chunker to run some random amount of data through the buzhash chunker and then report at the end something like, “Sampled 3% of input with buzhash and found 10% deduplication. Your data may be a good candidate for the use of buzhash chunking”.
I was also thinking what about a combined buzhash and default chunker. Basically partition on whatever buzhash returns and have partitions at the regular intervals that the default chunker provides. Sure it would produce some random small chunks but what I’m thinking is you can’t deduplicate it as far as the blocks offered on the network but there’s no reason to not deduplicate it on the individual node. Right now if someone adds a file with the default chunker and someone else adds it with buzhash then any deduplication goes completely out the window and I’m guessing you’d end up with two copies. If you had he buzhash/default you’d be able to conceivably store the default chunks as references to the smaller chunks.
I still have a vague idea that something clever can be done with lthash to help the situation.