Pinning data without duplication?

I have a large 1TB dataset that I’d like to pin, but I would also like to be able to use and modify that data. From what I understand, when you pin data it makes a copy of it wherever your ipfs data directory is in some form that facilitates CID lookup. I’m also aware that there is an experimental FUSE features that lets you view pinned IPFS data as if it was on your local filesystem.

I’m wondering if there is (or ever will be) a way to pin data with some sort of copy-on-write mechanism? The use-case is, I have a dataset in a state I want to share, and I pin it with:

ipfs add --pin -r /path/to/data --progress

Ideally if the data and ipfs directory were in the same filesystem and the filesystem supported copy-on-write (e.g. btrfs / zfs), ipfs would be able to create references to all of the pinned files to prevent data duplication (or I suppose even hard-links might work for older filesystems). Then if the user modified something in /path/to/data, the copy-on-write mechanism preserves the original state so IPFS doesn’t lose references to the information it needs to provide the CID produced at pin-time. Effectively this provides the ability to not only snapshot directories or files, but to share them over the IPFS network without using double the disk space.

I expect this is not possible right now, but I’m wondering about the potential feasibility of something like this. Is this fundamentally not possible with the IPFS protocol? One of the biggest downsides of IPFS versus torrents for me right now, is that with torrents I can access the data while still seeding it and not be forced to have a second copy. I’m looking for a bit more insight into what the challenges or blockers are for implementing something like this.

I had to read your post at least half a dozen times before i got a very rough idea of what you’re actually asking. Your subject definitely contradicts your question :slight_smile:

A couple things.

  1. You essentially want “nocopy” and let your host filesystem do deduplication. You need the --nocopy command (ipfs add --nocopy --pin -r /path/to/data --progress). Consider the IPFS data folder to now just store references to blocks instead of the full file itself. It’s more complicated but that’s the short version.
  2. You want to have an IPFS state and have copy-on-write that bftrfs/zfs provide. So you want bafy....1234 to refer to a state and bafy....4567 to refer to a different state of the same folder. Yeah, not possible. That would require deep integration of how a filesystem works (btrfs or zfs) into the logic of IPFS and that is with the assumption that those filesystems even expose a way to access older states. At that point you might be writing a new filesystem.
  3. You seem to be wanting a copy-on-write and at that write point to have a new CID for that folder too. You can, kinda, have that with KUBO, it’s called “MFS” or Mutable File System, you can find how it works in the cli commands (it’s all the ipfs files <command> ones https://docs.ipfs.tech/reference/kubo/cli/#ipfs-diag-sys.

I could wel be misinterpreting your questions so do clarify them if the above is an not applicable to what you asked. I did my best :wink:

Thank you. Apologies for the lack of clarity. I probably over-complicated it. Let’s forget about the whole copy-on-write thing. Point 1 is the main thing I care about. I only want 1 copy of the data on the system, but I want IPFS to be able to provide it to others (so long as I don’t delete / modify it).

Point 1 seems like a positive answer to my question, but I went to test it and it didn’t quite work the way I expected. What I wanted to test is if I add a random file with nocopy, and then I overwrite the file I pinned, that should be the only copy of the data, so trying to ipfs cat the data should fail, but it doesn’t look like it is.

Here is what I did (on an ext4 filesystem):

ipfs config --json Experimental.FilestoreEnabled true

mkdir -p "$HOME"/tmp/test-ipfs-mfs
cd "$HOME"/tmp/test-ipfs-mfs

# Create random data
mkdir -p data
head -c1024 /dev/random | base32 > data/file1.txt

# Pin the data in no-copy mode
ipfs add --nocopy --pin -r ./data --progress | tee "new_pin_job.log"
NEW_ASSETS_CID=$(tail -n 1 new_pin_job.log | cut -d ' ' -f 2)
echo "NEW_ASSETS_CID = $NEW_ASSETS_CID"

# Test that the new CID is pinned
ipfs pin ls --type="recursive" --names | grep "$NEW_ASSETS_CID"

# Get the CID of file1
FILE1_CID=$(cat new_pin_job.log | grep file1.txt | cut -d ' ' -f 2)
echo "FILE1_CID = $FILE1_CID"

# Clobber the data in data/file1.txt
echo "clobber" > data/file1.txt

# Try to list file1, should this be possible if the data it was referencing was overwritten?
ipfs cat "$FILE1_CID"

At the end of this script, ipfs did cat out the pre-clobbered contents file1.txt. How is that possible ipfs is not creating a copy? Is possibly referring to a inode on disk via a hardlink, so even though I modified the file it is still able to access the blocks inside it?

At the end of this script, ipfs did cat out the pre-clobbered contents file1.txt. How is that possible ipfs is not creating a copy? Is possibly referring to a inode on disk via a hardlink, so even though I modified the file it is still able to access the blocks inside it?

That should not be possible… :worried:
No, it doesn’t work with hardlinks. I think it works on the file path and offsets to it’s size per block. Don’t quote me on it though.

You might need to restart ipfs after changing the datastore. Heck, i’m not even sure if the works on a running instance so you might also need to init again.

You can try it out fairly easily if it works (i know it does cause i’m using this a lot). Note down the size of your ipfs data folder, then add a big file. Say 100MB or so, something that shows up in a new size check. Next check the size of your data folder again. If that is grown by (at least) 100 mb then something isn’t working for you. If it doesn’t grow much (it will grow a little) then you’re fine.

That’s fantastic news :grin:, otherwise I would have been very confused. A simple restart of the daemon did the trick. I reran the script and a Error: unexpected EOF when I tried to cat the non-existing files.

This solves 1 of my 2 major problems: I can now pin/share new data without making a copy. Given that this is possible, what about the case of downloading data on a new machine?

I know that I could just ipfs get $CID to download it to my filesystem, but what if I want to download it to my filesystem and pin/share it in a no-copy way? (i.e. effectively what bittorrent does).

I know that I could just ipfs get $CID to download it to my filesystem, but what if I want to download it to my filesystem and pin/share it in a no-copy way? (i.e. effectively what bittorrent does).

This is a problem i too encountered and have found no solution for in just plain IPFS. Just doing ipfs get $CID “downloads” it alright but think it still ends up being stored in both your data folder and where you typed your command. I could be wrong here though, i need to retest this to see why it didn’t work for me.

Regardless, what you want is essentially a mirror situation of what you have on machine X where on machine Y. What i did in the past to achieve that is just simple wget. My conditions were that X and Y both had their gateway exposed where X could pull content from Y and the other way around. Once that was downloaded it was another ipfs add --nocopy ... to effectively have a mirrored situation. This is a tedious mechanism because it requires both nodes to have the same arguments else they could potentially generate a different CID. It’s also slow because you effectively need to go over your data twice (download is one, add is 2) which would be just once if it were done through ipfs.

Shameless plug, you can use cURL for this now too, it understands ipfs if you set a gateway!
IPFS_GATEWAY=http://... curl ipfs://<cid> works. Now if you have a normal IPFS installation your gateway address should be stored in ~/.ipfs/gateway which curl also looks for. You can also just create that gateway file and put the gateway address in there. If that exists just using curl ipfs://<cid> just works
This might simplify commands for your case.

it alright but think it still ends up being stored in both your data folder and where you typed your command

Does it? I thought ipfs get just downloaded the data to a location without populating the data folder?

Either way it seems like adding a --nocopy flag to ipfs get or ipfs pin add would be a reasonable and very useful feature request and something that could be feasibly implemented given that the underlying mechanisms exist. Even if an initial script just did the 2 pass solution with 1 command that would be a useful way to improve the UX. Future work could fix the 2 pass issue further improving UX.

cURL for this now too, it understands ipfs if you set a gateway

I’m very happy curl has this feature, but I’ve found gateways to be extremely unreliable when data is only pinned by one or two nodes. For example, I can ipfs ls bafybeiedwp2zvmdyb2c2axrcl455xfbv2mgdbhgkc3dile4dftiimwth2y just fine, but if I try to view it in a gateway (on a machine not running an ipfs daemon) it almost always times out. I hope in the future curl gets a mechanism to directly use the ipfs protocol.

Ah you’re right, it misses the --nocopy. Forgot that.

You’re missing the point.
I’m not proposing you to use public gateways. I’m proposing you to use your gateway. Only as a means to sync node A and B.

But, as we just figured out, you could also use ipfs get first. Technically there should be no real difference. However, i have many many many times hit the issue that node A for whatever magical reason occasionally [1] has a difficult time finding node B (and vice versa). Now if your purpose is only to sync data back and forth then you might as well ask the node directly hence my suggestion to use that node’s specific gateway. Again, i mean to use your specific node A or B gateway, not a public arbitrary one.

[1] Yes, this is also with peering setup.