How can we efficiently remove unpinned objects from a large datastore?

We are currently exploring the process of removing unpinned objects from our IPFS datastore.

While garbage collection (GC) is a potential solution, we face the challenge of dealing with a substantial amount of data, totaling over 70 TB. This makes the GC process time-consuming and not ideal for our needs.

An alternative approach we’ve considered is using the ‘block rm’ method to remove specific blocks. However, due to the sheer volume of our pinned files, it becomes impractical to efficiently check if a particular block is indirectly pinned.

Furthermore, both methods seem to interfere with the pinning process. If we can’t efficiently remove unpinned objects, the pinning process will be delayed until the removal process is finished.

Is there a way to effectively remove unpinned objects while still allowing the pinning of new files or minimizing the time required for removal?

We know this is an issue and would like to fix it however we hadn’t time to rewrite the GC from a full flush to a refcount instead.

Either:

  • You commit a go enginer for a month or give or take (really more like 1 weeks to 3 months, depends on unknown unknowns). With a bit of help from us they should be able to get a refcount GC. (that means there will be 1 expensive migration which will rescan the data, however from that point on the GC will be incremental instead)
  • You spin up a new cluster and gradully migrate the data over, once the data is migrated you nuke the old node. This can be less efficient than running GC however as you noted running the GC locks the pinning state and prevents to add more pins, doing this migration doesn’t block the new nodes in the cluster .

There are probably other quick wins but I don’t think anything is gonna help you for 70TB. Refcounting sounds like the best (but biggest change).

1 Like