We’re testing Kubo upgrades in our dev environment. When upgrading to anything above v0.15.0 we’ve noticed that gateway requests for IPNS names now take anywhere from 3-4m to resolve. Rolling back to v0.15.0 resolves the issue.
In the release notes for v0.16.0 and above, Kubo now ignores IPNS v1 signatures and is only compatible with v2:
Records that do not have a valid IPNS V2 signature, or exceed the max size
limit, will no longer pass verification, and will be ignored by Kubo when
resolving /ipns/{libp2p-key} content paths.
Could this be related to the slow performance we’re seeing with newer versions of Kubo?
The slow behavior can be re-created by running Kubo > 0.15.0 with the following:
$ ipfs config --json DNS.Resolvers '{"eth.": "https://dns.eth.limo/dns-query"}'
# Slow IPNS example: /ipns/k51qzi5uqu5dmjnxaa3ed22c0a6xaoyw5d75uacizdpc4u9rmxnqialvwat8y9/
$ time curl http://localhost:8080/ipns/simple.eth/ -s -o /dev/null
real 3m0.025s
user 0m0.021s
sys 0m0.010s
# Working IPFS example: /ipfs/bafybeighxhsavoanjqkqvnnpbkvoweurybjt7gauunbg37ueahcbze5ise/
$ time curl http://localhost:8080/ipns/vitalik.eth/ -s -o /dev/null
real 0m0.223s
user 0m0.007s
sys 0m0.006s
The node on which these tests were performed is publicly accessible and works fine when downgrading to Kubo 0.15.0.
Has anyone noticed this or has experience troubleshooting similar issues? Are there any debugging utilities or logs that might shed some insight into what’s responsible for the slowness? A large percentage of ENS dWebsites use IPNS now, so obviously we don’t want to deploy an upgrade that will negatively impact user experience.
Wow great post, I wish I had a direct answer for you. I do have a question though, are you using IPNS PubSub? And if not, does enabling that improve your situation at all?
Wow surprising to me, I found resolution times after the first to go much faster with that enabled.
My next idea is (if you have the resources for it, IIRC this increases your memory requirement) to see how the Accelerated DHT Client fares for you. As I found this to be critical for my VPS node.
Here is a bug I reported back in September that may (or may not) be related. The fact it happened for the same version and has to do with discovery when using the accelerated DHT is what makes me think they could be related.
This should not be impacted by the signature change, it’s the same bytes under the hood except we use different fields from the protobuf.
I don’t know without investigating and I don’t really have time right now for this sorry, one thing I would check is if the ipfs dht get command also sees a slow down between 0.15 and 0.15+ (the point is to see if the slow down are in the dht or ipns codepaths).
That intresting so it seems to be dht related, I recently merged some fixes for a latency bug in the DHT, but I’m suprised this would effect this, could you please retry with 0.19-rc1 when it is released (planned for this week) ?
The normal client does more work in a single DHT lookup than the accelerated client, so with no more info it still seems possible that libp2p resource manager is throttling and slowing things down.
Could you try running the same tests on v18.0.1 with resource manager disabled? That’s Swarm.ResourceMgr.Enabled = false. This will at least tell us if it’s RM or not, if it is then we can work on tweaking the RM limits.
Disabling ResourceMgrappeared to partially alleviate some of the poor gateway performance for one of the test cases but the rest are still unusable:
v18.0.1 Swarm.ResourceMgr.Enabled = false
Unreachable/slow IPNS gateway request
$ time ipfs dht get -v -- /ipns/k51qzi5uqu5dmjnxaa3ed22c0a6xaoyw5d75uacizdpc4u9rmxnqialvwat8y9 | grep says | wc -l
10
real 0m0.371s
user 0m0.010s
sys 0m0.009s
# Extremely slow
$ time curl -so /dev/null http://localhost:8080/ipns/k51qzi5uqu5dmjnxaa3ed22c0a6xaoyw5d75uacizdpc4u9rmxnqialvwat8y9/
real 3m0.016s
user 0m0.013s
sys 0m0.006s
Unreachable/slow IPNS gateway request
$ time ipfs dht get -v -- /ipns/k51qzi5uqu5dm7gcgw5wxlf81aszjbihejohdv6pjewxwhtsjoimufognxl8fz | grep says | wc -l
10
real 0m0.324s
user 0m0.008s
sys 0m0.009s
# Extremely slow
$ time curl -so /dev/null http://localhost:8080/ipns/k51qzi5uqu5dm7gcgw5wxlf81aszjbihejohdv6pjewxwhtsjoimufognxl8fz/
real 3m0.122s
user 0m0.006s
sys 0m0.012s
Successful IPNS gateway request
$ time ipfs dht get -v -- /ipns/k51qzi5uqu5djwbl0zcd4g9onue26a8nq97c0m9wp6kir1gibuyjxpkqpoxwag | grep says | wc -l
0
real 0m0.124s
user 0m0.003s
sys 0m0.018s
# Noticeable improvement
$ time curl -so /dev/null http://localhost:8080/ipns/k51qzi5uqu5djwbl0zcd4g9onue26a8nq97c0m9wp6kir1gibuyjxpkqpoxwag/
real 0m0.216s
user 0m0.000s
sys 0m0.010s
I’m not really sure why one of the test sites is able to be retrieved but the others are not. I’ve got plenty of additional examples if those would be helpful.
Just tried again with v0.20.0 and got the same results - IPNS performance is just incredibly slow. @Jorropo do you have any other suggestions that I can try?
edit - I should mention that almost every single IPNS gateway request completes approximately in 1m. Is this some kind of configurable timeout?
Testing again with v0.20.0 without--enable-pubsub-experiment and --enable-namesys-pubsub seemed to do the trick!
time ipfs resolve -r /ipns/k51qzi5uqu5djhw4rskl5h4ylujy8cwbhn3bcmgwxjzclrebis7uusb1gnfnmo/
/ipfs/QmXGSRcFfVhvwepRZFbg97AkTRo6kt23UDCiMYLcEzv8FC
real 0m0.704s
user 0m0.018s
sys 0m0.033s
I followed the instructions here for debugging. I found many instances of hung goroutines relating to these flags:
goroutine 586 [select, 89 minutes]:
github.com/ipfs/boxo/namesys/republisher.(*Republisher).Run(0xc001100870, {0x2c060b8, 0xc0012a8480})
github.com/ipfs/boxo@v0.8.1/namesys/republisher/repub.go:79 +0x12a
github.com/jbenet/goprocess.(*process).Go.func1()
github.com/jbenet/goprocess@v0.1.4/impl-mutex.go:134 +0x36
created by github.com/jbenet/goprocess.(*process).Go
github.com/jbenet/goprocess@v0.1.4/impl-mutex.go:133 +0x238
I’m not entirely sure what changed starting in v0.16.0 with regards to these features. I did find an open issue regarding the deprecation of namesys/pubsub in general, so it might be related. Thank you to everyone who assisted in troubleshooting this problem. I’m more than happy to provide additional findings if this can help anyone else.