IPFS and file deduplication

mathpunk · January 13, 2019, 10:46pm

Hey interplanetarians! I got sent her by some folks from the dat-project, who thought IPFS might be better suited for the use case I have in mind.

I’ve got a 1TB hard drive that I’d like to clean up and make useful. There’s two problems with the way that data on that drive is structured. For one, I have the problem where, when backing up my data in a hurry because a computer was failing, I ended up backing up backups — i.e. there is a degree of nesting in there where I have a bunch of data duplicated at more than one level. The other problem is, when I didn’t understand that deeply nested directories were a poor way of labeling data, I used them to try and divide up notes, PDFs, etc., by subject area. I was sort of using the directory system as a poor tagging system.

What I would like to do is write a program that examines files on the drive. If a file is novel to the program, I want its contents to get some canonical path for later retrieval, and I want to save the string representing its path on the TB drive in a database. If a file is NOT novel, I still want to save that path string, but associate it to the existing record.

I learned about content-addressible hashing a couple of years ago, and thought it might be the key to solving problems like this. But I also bet that I didn’t have to implement rolling checksums myself, and so I came here. Can someone help point me in the direction of any tools, libraries, or systems that could help me with this project?

leerspace · January 15, 2019, 1:36pm

This is essentially what ipfs add -r /path/to/root/of/directory/tree already does. The result will be a top-level hash for the directory tree. Duplicate files will only be stored once on-disk in the IPFS repository and everything will be addressed by an immutable hash.

Caveats:

this will copy all of your unique data into the IPFS repository by default (potentially duplicating disk space usage)
IPFS’ defaults aren’t suited for private data, so you’d need to make adjustments depending on your use case

The resulting root-level hash can then be mounted read-only as an immutable directory. Or it can be added to IPFS’ mutable filesystem (mfs) – though I don’t think there is currently a way to mount the mfs and interact with it as if it were a regular filesystem.

For a totally unrelated-to-IPFS answer, if all you want to do is stop storing duplicate files twice, you could use something like the hardlink tool on Linux to identify files based on file hash and replace duplicate files with hardlinks. Or if you want to identify the duplicate files so you can clean them up manually (or script it), then you could use something like fdupes to identify the duplicates for based on file hash and optionally delete duplicate copies.

qw09xu · September 27, 2019, 5:35am

you might be better off with a deduplicated backup application. for example borgbackup, restic, zpaq (latest version)?

lyrx · September 27, 2019, 6:40am

Try ipfs help pin update on the command line. The pin update command may also be related to your task.

Topic		Replies	Views
IPFS and deduplication Ecosystem and Usage use-cases-and-apps	7	919	June 4, 2022
Question about deplication go-ipfs	3	536	December 1, 2018
Is it possible to map the blocks to existing files? Help	1	693	May 23, 2017
Why not just a DHT of who has which file? Help	10	396	April 15, 2021
Instead of duplicating data to chunks, references to the local file	1	392	May 2, 2021

IPFS and file deduplication

Related topics