`ipfs pin add` often hangs with a wantlist of ~7 CIDs

I’m running a private swarm at work to push disk images to lab equipment, and I’ve found that ipfs pin add often hangs with seemingly just a few items in the want list shown by ipfs stats bitswap. Typically if I run a second instance of ipfs pin add it will complete within a few seconds and both ipfs pin add processes then return.

I’ve taken to looping over all my nodes to launch second, and sometimes third, ipfs pin add processes to get things moving along faster. That seems wrong, but I’m not sure how to collect enough forensics (etc) to make a constructive bug report.

My environment

Private swarm on a 10/8 corporate network. Latency is typically < 10ms. Throughput is typically ~800 MiB/s (i.e. 10Gbps), with the notable exception of my laptop which only has 3 MiB/s and acts as a seed for the update.

All the nodes are running kubo.linux-arm64 v0.31.1 on Raspberry Pis with kernel 6.6.20+rpt-rpi-v8.

Steps to reproduce

  1. Laptop: CID=$(ipfs add ./update.img.zst)
  2. Laptop: ipfs-cluster-ctl pin add $CID --name update.img.zst
  3. Wait for the two bootstrap nodes to report “PINNED” in ipfs-cluster-ctl status
  4. Laptop: Launch an Anisble playbook to start IPFS nodes on a number of RPi4B

Each Raspberry Pi 4B will then:

  1. ipfs daemon --enable-gc
  2. ipfs pin add $CID
  3. ipfs get $CID -o /some/path/update.img.zst
  4. ipfs pin rm $CID

Typically (2) will hang on about 1/5 of the IPFS nodes, so I loop over them to ssh -t $SOMEHOST ipfs pin add $CID which usually resolves it in about ~5-10s on each node.

How should I go about troubleshooting this? For example, which ipfs log level commands would be most appropriate?

Sample stuck nodes and loop to unstick them

What are the chances that GC is running while trying to add? GC will lock the pinning system.

Do you mean that 1 of the 5 nodes hangs, but then when running ipfs pin add in some other node, the node that was hanging finishes pinning then?

Does the node hang indefinitely otherwise? If not, for how long does it hang?

Does the hanging node report being connected to the other nodes that have the content (ipfs swarm peers)?

Does the hanging node have connectivity to all peers that have the content?

In the hanging node, what does ipfs routing findprovs "cid" say? If some are found, is it connected to those peer ids?

I cannot discard that this is a bitswap bug though and there are some improvements coming for next release, but first we should discard other things.