(draft) Common Bytes - standard for data deduplication

I’ve been thinking of posting a summary of everything related to this. I’m typing on my phone with clumsy big old fingers so please excuse typos. Correction; thank the lord for discuss.ipfs.io’s magic poster-aware auto-draft saving, I was able to resume editing on my laptop.

IPFS currently supports chunk/block based de-duplication with several different chunkers. This has the following features/limitations (numbered for easy reference in followup discussions);

  1. The same file added with different chunkers or chunker parameters will not de-duplicate at all. This could be solved to some degree if merkle-dag entries were indexed by the full hash of the content they pointed to instead of the hash of the merkle-dag node. However, this would be a very invasive change likely to introduce more problems than it solves.

  2. Chunk de-duplication can only de-duplicate identical data as big as the chunks. This means smaller chunks can do more de-duplication by finding smaller duplications, but at some point the per-block overheads will override the de-duplication benefits. It would be good to know what the minimum practical block size is, but it’s definitely larger than a single line of text. Note that this means something as simple as converting a text file from \r\n DOS lines to Unix \n lines will break block de-duping.

  3. The default chunker is fixed 256K blocks. This will only de-duplicate differences within files that exactly align with the 256k block boundaries. This pretty much limits it to things like large database files with fixed record sizes or files that are only appended to.

  4. The rabin chunker should be able to produce more de-duplicate-able blocks at arbitrary offsets provided the same parameters are used. The default rabin chunker parameters are min/avg/max 128k/256k/378k, which is very large and thus won’t find any duplication smaller than that. I would guess 16k/32k/64k would be better, depending on per-block overheads.

  5. Content aware chunkers could do even better at deciding on chunk boundaries than rabin, but at a high complexity cost (content-specific chunkers for each different file type will need to be written/maintained) for probably marginal benefit over rabin with decent parameters. Note that they will still not be able to address 2 and could make 1 even worse (yet more different ways of chunking). It also can’t solve problems like 6 below where the content doesn’t actually have any duplicated chunks.

  6. Compressed files will typically have no duplication for even tiny changes in the compressed content. This can be solved to some extent using things like gzip --rsyncable which minimizes changes in the compressed output for a small compression cost, but that needs to be done when the file is compressed before it is added to IPFS.

  7. Identical content compressed with different compressors or compression arguments will have no de-duplication. Sometimes different compressor versions or even just compression dates can produce different output due to format changes or included timestamps.

  8. The decompressed content of originally identical files can be different for lossy compression like mpeg, jpeg, mp3 etc. This can be addressed for things like music storage by storing only the highest quality versions, fuzzy matching content to de-duplicate identical songs on upload, and transcoding down to the desired quality/format on demand for downloads. Does this sound easy to you?

  9. Content-aware transcoding could be used to automatically convert formats with poor block-de-duplication into different formats with good block-de-duplication, eg files could be uncompressed, tarballs expanded into directories, etc. A top level node could include details necessary to convert the content back to the originally uploaded format, so the same content uploaded in different formats would fully de-duplicate all the content, with only different top-level “format” nodes. Alternatively canonical formats could be always be used for contents, and an API provided to request the content transcoded into different formats… eg a directory could be fetched as a tar-file, or tar.gz file. Recursion could handle arbitrarily nested formats. Again, the complexity cost of this is high, with lots of format-specific transcoding support, but the return might be high for some file formats.

  10. Compression is fine-grained efficient smaller-than-block-size de-duplication to get past 2. Note using 9 to uncompress content to get better block-de-duplication removes the compression-de-duplication, which might be worse overall. Normally compression only finds duplication within a file, but using a shared compression dictionary is how you can de-duplicate across files. Some advanced compressors have built in dictionaries for different content-types, but in the most general case things like zstd have rich support for building and re-using custom compression dictionaries. This could be used to do something like add compressed block support, where the merkle-dag could indicate that the blocks have been compressed using a dictionary stored in another file/blocks. It’s better if dictionaries are widely re-used for many files, and exactly how best to build/share/use dictionaries for different content is a broad and complex topic. You could end up making things worse, forcing people to download large dictionaries to uncompress small files.

  11. Compression of blocks can be already implemented on nodes by using a compressing filesystem under IPFS. However, this can only do generic compression and cannot take advantage of content-type information for better compression. It also doesn’t help network transfer of blocks, which is probably more important. Network transfer can benefit from on-the-fly network compression, but it’s pretty ugly when combined with filesystem compression; you uncompress it to read it, only to compress it again to send it, and you do this every time you send it. Implementing block compression within IPFS may be nice to get decent storage, network, and cpu wins. If I was to add this I’d make it another option when adding the file, similar to how you can select the chunker, so that users could make content-type based decisions on the best compression options/dictionaries/etc.

  12. Different file formats and compression are effectively content encoding, and sometimes the particular encoding used matters. Sometimes you really don’t want the file messed with by magic transcoding etc. For example, the entire point of an *.mkv file of a movie might be the finely tuned encoding parameters to give the best compromise between file size and quality. A *.tar.gz file might be an important software snapshot that has been check-summed and signed; see https://github.com/librsync/librsync/issues/146 for an example where github’s re-generating of release archives broke this for many people.

Note that almost all of the ideas above to give better de-duplication don’t have to be built into IPFS at all. They could be implemented by adding a content-type aware layer on top of IPFS to do things like transcode to/from cannonical formats with good block-de-duplication, and select the optimal chunking settings to use. Given the complexity of many of these, keeping them OUT of IPFS and in their own optional layer where people can experiment with and tune them without breaking IPFS itself is a good idea.

IMHO the most important (and easy) things to focus on within IPFS would be better generic chunkers and default parameters (see 1->4 above) and maybe (harder) block compression (see 11).