Require help with an approach of archiving data from ipfs node

chaitanyaprem · June 14, 2022, 11:29am

We have a go-ipfs node where new data (file objects as well as IPLD data) gets pinned every few minutes. Our data pinned to the node grows at the rate of 15GB per day. We would like to keep archiving data older than 30 days and upload it to filecoin.
The problem is with the GC as in whenever it runs, it blocks all IPFS/IPLD add/dag-put operations. I have seen there are open issues on the same [META] Garbage Collection Enhancement/Rework · Issue #7752 · ipfs/kubo · GitHub. We have had to disable auto-gc for the same reason.
It would be great if there is any work-around or an alternate solution to achieve the same (i.e backup to a CAR file and mainly remove archived entries from IPFS even when new data is getting pinned).
Can I use ipfs-cluster to address this problem someway (where-in during archival and cleanup of old data, new pin transactions are not blocked)?

Our ipfs node version and we allocate ~250GB of disk-space for it.

go-ipfs version: 0.14.0-dev-5615715
Repo version: 12
System version: arm64/linux
Golang version: go1.18.1

Jorropo · June 14, 2022, 12:03pm

See ipfs dag export --help.
You might need to checkout other tools to split them in 32GiB archives.

chaitanyaprem · June 14, 2022, 12:45pm

Thanks @Jorropo , but this command will only export data on the node into a CAR file.
We would still have to unpin all the CIDs exported and do a manual GC on the IPFS node right?
Then it would block all running operations i.e new adds/dag-put running.

Would want to know if there is any way to handle that.

Jorropo · June 14, 2022, 12:58pm

Just unpin them and do ipfs repo gc once the data is safe on filecoin, I don’t really see any problem with that.

Jorropo · June 14, 2022, 1:00pm

@chaitanyaprem are you using your IPFS node for transiant storage ?

Because if then when we write thoses kinds of things, we rarely use an ipfs node, instead we use the underlying libs directly and write them in a streaming / pipelined way,

chaitanyaprem · June 14, 2022, 4:11pm

Yes, that is what we were planning to do. (unpin on ipfs and do repo gc once data is safe on filecoin)
But, the problem occurs when we run repo gc.
First of all it takes a lot of time if there are many objects to be gc’d…for ex when there were close to 5 million objects GC ran for 30 mins after which we had to kill it and none of the unpinned objects were cleaned-up.
Secondly, during the time of GC run any new pin operations (ipfs add/ipfs dag put) done via RPC API get blocked and timeout.

We are using IPFS node as HOT_STORAGE layer to serve the most recent data and plan to migrate the data older than 30 days as archival to filecoin.

Hence, i wanted to see if there is any way where RPC operations (add/dag-put) towards ipfs continue even during GC (either by using ipfs-cluster) in some form.

Topic		Replies	Views
Workaround to deal with go-ipfs GC Help go-ipfs , ipfs-cluster	3	487	June 27, 2022
Disable IPFS Cluster GC IPFS Cluster go-ipfs , ipfs-cluster , files	1	512	March 9, 2022
Disk full running a node	16	778	June 30, 2022
How to fix error adding new providers , no space left on device Help	6	644	June 3, 2022
Need way to quickly garbage-collect Help	8	559	November 30, 2021

Require help with an approach of archiving data from ipfs node

Related topics