Constant 100% CPU utilization on AWS EC2 IPFS node

zxriptor · November 6, 2023, 12:34pm

We see a constant 100% CPU load on our t3a.large EC2 instance, a machine goes unresponsive sometimes due to this. We use this node to store project files, the team causes a low activity as this is our development environment. Repo is also not huge, less than 4Gb.
What I already tried:

upgraded from 0.19.0 to 0.23.0
applied lowpower profile

This did not help at all. Is there any way to narrow down what subsystem causes that load? I enabled debug logs and collected CPU profiling data but those didn’t give me any clue about what is going on.

Jorropo · November 6, 2023, 12:38pm

Can you run ipfs diag profile and post it here ?
This will capture a profile of where cpu time is spent.

zxriptor · November 6, 2023, 12:52pm

Thank you for your prompt reply. I didn’t find a way to upload zip so here is the dropbox link to the archive Dropbox - ipfs-profile-2023-11-06T12_39_12Z_00.zip - Simplify your life

Jorropo · November 6, 2023, 1:00pm

Thx this is clearly a bug somewhere.
We seems to be doing some weird timer things.

I don’t know on the top of my head.

Jorropo · November 6, 2023, 1:03pm

This could be a time.Ticker that is leaked.

Jorropo · November 6, 2023, 1:12pm

I pushed a fix here: swarm: fix timer Leak in the dial loop by Jorropo · Pull Request #2636 · libp2p/go-libp2p · GitHub
If go-libp2p makes a v0.32.1 backport release I’ll try to include this in Kubo v0.24.0.

zxriptor · November 6, 2023, 1:24pm

Thank you very much for your help! I’ll continue to watch for that in github. Is there any workaround you can suggest? We are fine to limit swarm functionality temporarily (if it is possible) till the fix makes its way to the release.

Jorropo · November 6, 2023, 1:45pm

I don’t know of a workaround.

Jorropo · November 8, 2023, 9:42am

Thx to go-libp2p making a v0.32.1 this should be fixed in Kubo v0.24.0, it should release later today.

zxriptor · November 10, 2023, 10:17am

It is better now, thank you for the quick fix. Utilization doesn’t stay 100% all the time however there are periodic prolonged spikes to 80-95% that I hardly can explain due to its cyclical look.

Jorropo · November 10, 2023, 11:04am

That is still an improvement however it does not surprise me that you have an other issue because the bug in go-libp2p was there for a few releases and yet you are the first one to report it (it would not leak in usual operations).
If you could capture a new profile which capture the spike this would be nice, you can increase the duration with ipfs diag profile --profile-time 1m if you need something longer to catch the spike.

zxriptor · November 10, 2023, 11:14am

Here you go Dropbox - ipfs-profile-2023-11-10T11_09_49Z_00.zip - Simplify your life

zxriptor · November 15, 2023, 5:40pm

@Jorropo Is that an issue also caused by the timers thing? I am wondering if maybe we can downgrade to one of the previous stable versions where this bug does not exist. This issue is really major for us as it has a high impact on our system performance now.

Jorropo · November 18, 2023, 8:10am

I’m looking into it, after discussing with some libp2p maintainers the only possible way for this bug to trigger the way it did is for you to do lots of dial attempts on nodes that have zero addresses. Do you know if anything in your workflow could make you dial unreachable peers ? Or peers we can’t find addresses for ?

zxriptor · November 22, 2023, 7:00pm

It is hard to answer this as this is too low-level stuff for me. We do nothing beyond basics: interacting with MFS, and publishing IPNS names. If you have any specific requests, I am happy to get this information for you. Should I look at bootstrap addresses or anywhere else?

Topic		Replies	Views
Kubo 100% CPU usage causes device to crash Kubo go-ipfs	2	277	November 3, 2022
Node using 40Gb ram and 16 cores, still OOMing Kubo	10	245	September 24, 2024
Kubo hogging CPU and errors in docker logs	3	210	July 27, 2022
Kubo v0.33.0-rc1 is out! News go-ipfs , kubo	7	83	January 23, 2025
Excessive/Expected IPFS memory, threads, CPU? - private deployment Kubo	1	1272	January 21, 2021

Constant 100% CPU utilization on AWS EC2 IPFS node

Related topics