We see a constant 100% CPU load on our t3a.large EC2 instance, a machine goes unresponsive sometimes due to this. We use this node to store project files, the team causes a low activity as this is our development environment. Repo is also not huge, less than 4Gb.
What I already tried:
- upgraded from 0.19.0 to 0.23.0
- applied lowpower profile
This did not help at all. Is there any way to narrow down what subsystem causes that load? I enabled debug logs and collected CPU profiling data but those didn’t give me any clue about what is going on.
Can you run
ipfs diag profile and post it here ?
This will capture a profile of where cpu time is spent.
Thank you for your prompt reply. I didn’t find a way to upload zip so here is the dropbox link to the archive Dropbox - ipfs-profile-2023-11-06T12_39_12Z_00.zip - Simplify your life
Thx this is clearly a bug somewhere.
We seems to be doing some weird timer things.
I don’t know on the top of my head.
This could be a
time.Ticker that is leaked.
I pushed a fix here: swarm: fix timer Leak in the dial loop by Jorropo · Pull Request #2636 · libp2p/go-libp2p · GitHub
If go-libp2p makes a
backport release I’ll try to include this in Kubo
Thank you very much for your help! I’ll continue to watch for that in github. Is there any workaround you can suggest? We are fine to limit swarm functionality temporarily (if it is possible) till the fix makes its way to the release.
I don’t know of a workaround.
Thx to go-libp2p making a v0.32.1 this should be fixed in Kubo v0.24.0, it should release later today.
It is better now, thank you for the quick fix. Utilization doesn’t stay 100% all the time however there are periodic prolonged spikes to 80-95% that I hardly can explain due to its cyclical look.
That is still an improvement however it does not surprise me that you have an other issue because the bug in go-libp2p was there for a few releases and yet you are the first one to report it (it would not leak in usual operations).
If you could capture a new profile which capture the spike this would be nice, you can increase the duration with
ipfs diag profile --profile-time 1m if you need something longer to catch the spike.
@Jorropo Is that an issue also caused by the timers thing? I am wondering if maybe we can downgrade to one of the previous stable versions where this bug does not exist. This issue is really major for us as it has a high impact on our system performance now.
I’m looking into it, after discussing with some libp2p maintainers the only possible way for this bug to trigger the way it did is for you to do lots of dial attempts on nodes that have zero addresses. Do you know if anything in your workflow could make you dial unreachable peers ? Or peers we can’t find addresses for ?
It is hard to answer this as this is too low-level stuff for me. We do nothing beyond basics: interacting with MFS, and publishing IPNS names. If you have any specific requests, I am happy to get this information for you. Should I look at bootstrap addresses or anywhere else?