Disk space consumption in IPFS

I’ll add to the messy brain dump. I find working with deduplication with IPFS difficult other than the very straightforward. Add the same file (with the same hasher and chunked) and you get the same blocks. Using the standard chunker you could get some deduplication although it’s not that likely. It’s difficult to tell how much deduplication has occurred after adding content. If you want better deduplication by using a different chunked you basically have to add it a second time and compare to even see if it was helpful.

I’ve also found there are almost two notions of deduplications, maybe even three. You can have deduplication within content that’s being added. I make the distinction here because at this point you can control the hasher/chunker because you can always add it agin with new ones. Then there’s deduplication with data that might already be on your node. I make this distinction because you can’t really distinguish between deduplication from data that was already on your node, possibly from other content and deduplication that occurred within the content the was just added. The third one would be deduplication that can take place because it’s already available on the network. It’s a weird one because it’s like duplicate deduplicated data. It’s on the network multiple times but has a single reference, the CID.

I kind of wonder what the value of a content based chunker is if it can so easily be added a second time that would result in two completely separate copies of the same content. I can measure the first type of deduplication (the internal type) but I can’t really know the kind of deduplication that may occur when more content is added.

I’ve also been wondering if deduplication in the datastore is always a good idea. Sure there is the concept of deduplication across IPFS by CID but that doesn’t mean it needs to be deduplicated in the datastore. I can think can think of some cases where you might say, “I’ve got plenty of storage that I’d be happy to trade for better access times”