Ipfs cluster node using up all system resources after restart

We have a 3 node ipfs-cluster setup (each node has 8 core CPU and 16GB RAM) and we had to restart one of the nodes in between for some maintenance. The node was down for 15 mins or so and after restarting, we could see that it is using up all the CPU available and is running on overload from more than a day. When i enabled the logs in the cluster service, i could constantly see the following logs and also logs indicating pin ls which i am guessing is done to syncup the pinset state.

2022-10-20T13:41:26.433Z	INFO	pintracker	stateless/stateless.go:633	Restarting pin operation for bafyreifepl3y6hezsav6vwinx6qxgdpmz4lszgnnmntpm7jjsttxuhqurq
|2022-10-20T13:44:20.252Z|INFO|crdt|crdt/consensus.go:244|new pin added: bafkreibgp352z54gmjeltpecijgedoxpfdwgez2du5tz3eeu3nm6y7xuyi|
|2022-10-20T13:44:20.417Z|INFO|ipfshttp|ipfshttp/ipfshttp.go:593|IPFS Unpin request succeeded:bafyreievlo7c64nbdl7jw3jdebzi4h5mno664z3evedijw4tcpokay5s3q|
|2022-10-20T13:44:20.434Z|INFO|crdt|crdt/consensus.go:244|new pin added: bafkreicrsnpeucm3hp425qetjb7suunay7lovtoyalvkdymypk3imxoxnu|

The cluster is running with a replicationFactor of 1 and has millions of pins, as we are just using it to get around the GC issue

Note that the new node is also handling new traffic coming in from the RPC to either query data or pin new data.

Below is screenshot from sensu which has been showing the node as overloaded from when it was restarted.

  • What could be the issue that is causing this node to not finish its sync?
  • Should we in general wait for a node to sync its state before open it for traffic when it is part of the cluster?
  • Are there any metrics i could take a look at to know what is causing the bottleneck?

Can you confirm which process is using the CPU? Cluster or kubo?

Can you show which cluster version you are using? Every few minutes there should also be a message like:

INFO crdt go-ds-crdt@v0.3.9/crdt.go:562 Number of heads: 1. Current max height: 39. Queued jobs: 0. Dirty: false.

What is this message like in your case?

I could see both ipfs and ipfs-cluster using CPU, they keep taking turns and constantly switch from low to high CPU usage.

Versions being used:
ipfs/ipfs-cluster:latest (this we have picked up after you have merged this fix in master)
ipfs version 0.15.0

Also i can see the below log constantly in the ipfs service

|2022-10-20T15:18:32.298Z|ERROR|core/commands/cmdenv|pin/pin.go:133|context canceled|
|2022-10-20T15:25:09.352Z|ERROR|core/commands/cmdenv|pin/pin.go:133|context canceled|
|2022-10-20T15:28:16.979Z|ERROR|core/commands/cmdenv|pin/pin.go:133|context canceled|

I could see the message at 2 different times as below

2022-10-20T13:41:00.062Z INFO crdt go-ds-crdt@v0.3.7/crdt.go:562 Number of heads: 2. Current max height: 42735. Queued jobs: 0. Dirty: false

2022-10-20T15:31:00.053Z INFO crdt go-ds-crdt@v0.3.7/crdt.go:562 Number of heads: 1. Current max height: 42779. Queued jobs: 0. Dirty: false

Looks like it has finally recovered, but it almost took 4-5 days to recover.
Is there any tuning that needs to be done to improve in handling such restarts, otherwise the node cannot be used until it syncs again for days together?

did you find a solution?