We have a 3 node ipfs-cluster setup (each node has 8 core CPU and 16GB RAM) and we had to restart one of the nodes in between for some maintenance. The node was down for 15 mins or so and after restarting, we could see that it is using up all the CPU available and is running on overload from more than a day. When i enabled the logs in the cluster service, i could constantly see the following logs and also logs indicating pin ls
which i am guessing is done to syncup the pinset state.
2022-10-20T13:41:26.433Z INFO pintracker stateless/stateless.go:633 Restarting pin operation for bafyreifepl3y6hezsav6vwinx6qxgdpmz4lszgnnmntpm7jjsttxuhqurq
|2022-10-20T13:44:20.252Z|INFO|crdt|crdt/consensus.go:244|new pin added: bafkreibgp352z54gmjeltpecijgedoxpfdwgez2du5tz3eeu3nm6y7xuyi|
|2022-10-20T13:44:20.417Z|INFO|ipfshttp|ipfshttp/ipfshttp.go:593|IPFS Unpin request succeeded:bafyreievlo7c64nbdl7jw3jdebzi4h5mno664z3evedijw4tcpokay5s3q|
|2022-10-20T13:44:20.434Z|INFO|crdt|crdt/consensus.go:244|new pin added: bafkreicrsnpeucm3hp425qetjb7suunay7lovtoyalvkdymypk3imxoxnu|
The cluster is running with a replicationFactor of 1 and has millions of pins, as we are just using it to get around the GC issue
Note that the new node is also handling new traffic coming in from the RPC to either query data or pin new data.
Below is screenshot from sensu which has been showing the node as overloaded from when it was restarted.
- What could be the issue that is causing this node to not finish its sync?
- Should we in general wait for a node to sync its state before open it for traffic when it is part of the cluster?
- Are there any metrics i could take a look at to know what is causing the bottleneck?