How to efficiently fetch/pin incremental changes to a large dataset?

solipsis-project · November 16, 2024, 10:28pm

The context:

I have a private swarm that is incrementally building a large dataset. Each node is responsible for building part of the dataset in its MFS, and it periodically sends its root hash to a command server. The command server uses these hashes to populate its own MFS. Thus, the root hash of the command server’s MFS contains all of the currently collected data.

The objective:

I would like the command server’s node to periodically fetch all of the data in the private swarm, so that I have all the data in one place (the command server is going to be longer lived than the other nodes). Ideally I would also like to create “snapshots”, pins that capture the dataset on a specific date.

In theory this is simple enough: all I need to do is pin the MFS root hash. I can even use the --name flag in order to name the pin with the current date. This will force the node to pull the needed blocks from the other nodes in the network.

However, ipfs pin add will walk the entire dag every time, even if all of the blocks are already local. I want the time required to scale with the number of missing blocks, not the total size of the data.

Fortunately, we also have ipfs pin update --unpin=false, which is designed for just this use case: it takes an old hash and a new hash and diffs the dags to identify only the parts that have changed.

Unfortunately, ipfs pin update appears to have some limitations:

ipfs pin update does not support the --progress flag, making it impossible to measure progress and estimate how long the operation will take.
The new pin will have the same name as the old pin. I can rename the new pin by running ipfs pin add with the new name, but this results in walking the entire dag again, defeating the purpose of using update in the first place.
Most concerning: If I’m using ipfs pin add to rename an existing pin, and the process gets terminated, the hash is unpinned!!!

It seems like my best option is to use ipfs pin update, but not name the pins within kubo and just track the names externally.

But I want to make sure that there’s not some other solution that I’m missing:

I haven’t been using Clusters because those seem to be for distributing many pins across nodes, which is a slightly different use case.
Is there a better way to efficiently fetch only the new blocks from the other nodes?
Is there another way to rename pins without running ipfs pin add?

Topic		Replies	Views
Go-ipfs 0.4.10 release News	2	1369	August 10, 2018
Can't get ipfs-pack working, so I tried this Help	2	401	December 6, 2019
Cluster Auto Pinning after MFS Writes Help ipfs-cluster	10	342	March 16, 2022
Pin add <hash> is very slow? Kubo go-ipfs	4	1297	September 13, 2018
Learning about new pins and completed pins on cluster Help go-ipfs , ipfs-cluster	0	117	April 9, 2024

How to efficiently fetch/pin incremental changes to a large dataset?

Related topics