I don’t want to file a bug just yet, but there’s definitely something hinky with rc2 when compared to rc1, which possibly has to do with the changes to the connection code.
The biggest symptom is that reprovides now take vastly longer (actually, they never complete, and the number of connections stays high. 4000+ connections, which is normal during reprovides, compared to around 200 when not reproviding).
My guess is that it’s failing to close certain kinds of connections and eventually reaches its maximum and stalls, pretty much forever. Another symptom is that if I then try to stop the daemon (with ctl-C), it doesn’t exit, until I hit ctl-C again (and I’ve waited a long time).
I’ve only gone through a couple of cycles of this so far, but the behavior was the same in both cases. I’ll continue investigating until I gain a better understanding of what’s actually going on, but I wanted to give a preliminary report, so that others can look into it as well.
Just for info, I’m not using the accelerated DHT, but I’m using the optimistic provide. Exact configuration will be provided when I file a bug.
Thank you for early flag @ylempereur .
This release updated go-libp2p and go-libp2p-kad-dht so would be good to understand where the difference comes from. Could be a regression, or could be effect of connectivity fixes lifting artificial limits that existed before.
On high number of connections, i’ve seen it on a publicly diallable node that is also a DHT Server, so a few follow up questions:
Are you sure reprovide system is slower or just guessing (is ipfs stats provide showing lower AvgProvideDuration for similar TotalProvides as older version?)
Do you have custom ipfs config Swarm.ConnMgr, or empty (running default) ?
Small ask: are you able to check if you experience the same high number of connections if you set ipfs config Routing.Type to autoclient and restart the node?
fwiw I see high number of connections only when running with auto. switching to autoclient is a night and day and keeps it at few hundred, which makes me think this is related to DHT server somehow.
Is the ctrl-C occuring every time, or only after running for a while and reaching out >3k connections? (i was not able to reproduce yet)
Shot in the dark: in -rc2 we’ve updated from go1.22 to go1.23, do both problems go away if you run with env variable that restores timers from go 1.22? ( GODEBUG=asynctimerchan=1 ipfs daemon ).
I’m unfortunately on vacation right now, so it’s hard to control my node that is running at home. I’ll try and find some time to experiment in the next few days.
A couple things I can tell you right away though:
The reprovide NEVER completes. My node has been running more than two days now, and the first reprovide is still running (it normally takes a bit over 2 hours). It still has over 4K connections, and “ipfs stats provide” still reports all zeros.
Swarm.ConnMgr is {}
I’ll try and test the other 3 things as soon as I get a chance.
No worries. I’ve run two nodes side by side over night and confirmed go1.23 timers are what makes the difference. Running with GODEBUG=asynctimerchan=1 fixes the regression.