IPFS and file deduplication

Hey interplanetarians! I got sent her by some folks from the dat-project, who thought IPFS might be better suited for the use case I have in mind.

I’ve got a 1TB hard drive that I’d like to clean up and make useful. There’s two problems with the way that data on that drive is structured. For one, I have the problem where, when backing up my data in a hurry because a computer was failing, I ended up backing up backups — i.e. there is a degree of nesting in there where I have a bunch of data duplicated at more than one level. The other problem is, when I didn’t understand that deeply nested directories were a poor way of labeling data, I used them to try and divide up notes, PDFs, etc., by subject area. I was sort of using the directory system as a poor tagging system.

What I would like to do is write a program that examines files on the drive. If a file is novel to the program, I want its contents to get some canonical path for later retrieval, and I want to save the string representing its path on the TB drive in a database. If a file is NOT novel, I still want to save that path string, but associate it to the existing record.

I learned about content-addressible hashing a couple of years ago, and thought it might be the key to solving problems like this. But I also bet that I didn’t have to implement rolling checksums myself, and so I came here. Can someone help point me in the direction of any tools, libraries, or systems that could help me with this project?

This is essentially what ipfs add -r /path/to/root/of/directory/tree already does. The result will be a top-level hash for the directory tree. Duplicate files will only be stored once on-disk in the IPFS repository and everything will be addressed by an immutable hash.

Caveats:

  • this will copy all of your unique data into the IPFS repository by default (potentially duplicating disk space usage)
  • IPFS’ defaults aren’t suited for private data, so you’d need to make adjustments depending on your use case

The resulting root-level hash can then be mounted read-only as an immutable directory. Or it can be added to IPFS’ mutable filesystem (mfs) – though I don’t think there is currently a way to mount the mfs and interact with it as if it were a regular filesystem.

For a totally unrelated-to-IPFS answer, if all you want to do is stop storing duplicate files twice, you could use something like the hardlink tool on Linux to identify files based on file hash and replace duplicate files with hardlinks. Or if you want to identify the duplicate files so you can clean them up manually (or script it), then you could use something like fdupes to identify the duplicates for based on file hash and optionally delete duplicate copies.

you might be better off with a deduplicated backup application. for example borgbackup, restic, zpaq (latest version)?

Try ipfs help pin update on the command line. The pin update command may also be related to your task.