Kubo versions > 0.15.0 gateway and slow IPNS performance

Hello IPFS community,

We’re testing Kubo upgrades in our dev environment. When upgrading to anything above v0.15.0 we’ve noticed that gateway requests for IPNS names now take anywhere from 3-4m to resolve. Rolling back to v0.15.0 resolves the issue.

In the release notes for v0.16.0 and above, Kubo now ignores IPNS v1 signatures and is only compatible with v2:

Records that do not have a valid IPNS V2 signature, or exceed the max size
limit, will no longer pass verification, and will be ignored by Kubo when
resolving /ipns/{libp2p-key} content paths.

Could this be related to the slow performance we’re seeing with newer versions of Kubo?

The slow behavior can be re-created by running Kubo > 0.15.0 with the following:

$ ipfs config --json DNS.Resolvers '{"eth.": "https://dns.eth.limo/dns-query"}'


# Slow IPNS example: /ipns/k51qzi5uqu5dmjnxaa3ed22c0a6xaoyw5d75uacizdpc4u9rmxnqialvwat8y9/
$ time curl http://localhost:8080/ipns/simple.eth/ -s -o /dev/null

real	3m0.025s
user	0m0.021s
sys	0m0.010s

# Working IPFS example: /ipfs/bafybeighxhsavoanjqkqvnnpbkvoweurybjt7gauunbg37ueahcbze5ise/
$ time curl http://localhost:8080/ipns/vitalik.eth/ -s -o /dev/null

real	0m0.223s
user	0m0.007s
sys	0m0.006s

The node on which these tests were performed is publicly accessible and works fine when downgrading to Kubo 0.15.0.

Has anyone noticed this or has experience troubleshooting similar issues? Are there any debugging utilities or logs that might shed some insight into what’s responsible for the slowness? A large percentage of ENS dWebsites use IPNS now, so obviously we don’t want to deploy an upgrade that will negatively impact user experience.

Any advice is much appreciated! Thank you.

1 Like

Wow great post, I wish I had a direct answer for you. I do have a question though, are you using IPNS PubSub? And if not, does enabling that improve your situation at all?

Hi @Discordian,

Yes, we do have IPNS Pub/Sub enabled. Here is a snippet from our Systemd unit file:

ExecStart=/usr/bin/ipfs daemon --enable-gc --migrate=true --enable-pubsub-experiment --enable-namesys-pubsub

I have not noticed any improvement using this flag.

1 Like

Wow surprising to me, I found resolution times after the first to go much faster with that enabled.

My next idea is (if you have the resources for it, IIRC this increases your memory requirement) to see how the Accelerated DHT Client fares for you. As I found this to be critical for my VPS node.

We have been using the AcceleratedDHT too!

/usr/bin/ipfs config --json Experimental.AcceleratedDHTClient true
1 Like

Oh my! Perhaps @Jorropo has an idea (or maybe knows who might?). Very sorry for the inconvenience here.

Thank you for looking into this. It’s been quite a headache for us. Hopefully we can get to the root cause and proceed with our upgrade.

1 Like

Absolutely, I will raise this thread everywhere I can to try to get you answers :pray:

Here is a bug I reported back in September that may (or may not) be related. The fact it happened for the same version and has to do with discovery when using the accelerated DHT is what makes me think they could be related.

1 Like

This should not be impacted by the signature change, it’s the same bytes under the hood except we use different fields from the protobuf.
I don’t know without investigating and I don’t really have time right now for this sorry, one thing I would check is if the ipfs dht get command also sees a slow down between 0.15 and 0.15+ (the point is to see if the slow down are in the dht or ipns codepaths).

1 Like

Here are some comparisons between each version and whether or not the AcceleratedDHT client is enabled. I’m happy to provide additional diagnostics.

v18.0.1

AcceleratedDHT: false

$ time ipfs dht get -v -- /ipns/k51qzi5uqu5dmjnxaa3ed22c0a6xaoyw5d75uacizdpc4u9rmxnqialvwat8y9 | grep says | wc -l
27

real	0m17.181s
user	0m0.009s
sys	    0m0.039s

$ time ipfs dht get -v -- /ipns/k51qzi5uqu5djwbl0zcd4g9onue26a8nq97c0m9wp6kir1gibuyjxpkqpoxwag | grep says | wc -l
0

real	0m0.429s
user	0m0.018s
sys	    0m0.022s

AcceleratedDHT: true

$ time ipfs dht get -v -- /ipns/k51qzi5uqu5dmjnxaa3ed22c0a6xaoyw5d75uacizdpc4u9rmxnqialvwat8y9 | grep says | wc -l
10

real	0m0.300s
user	0m0.012s
sys	    0m0.007s

$ time ipfs dht get -v -- /ipns/k51qzi5uqu5djwbl0zcd4g9onue26a8nq97c0m9wp6kir1gibuyjxpkqpoxwag | grep says | wc -l
0

real	0m0.085s
user	0m0.008s
sys	    0m0.009s

v15.0.0

AcceleratedDHT: true

$ time ipfs dht get -v -- /ipns/k51qzi5uqu5dmjnxaa3ed22c0a6xaoyw5d75uacizdpc4u9rmxnqialvwat8y9 | grep says | wc -l
10

real	0m0.581s
user	0m0.008s
sys	    0m0.008s

$ time ipfs dht get -v -- /ipns/k51qzi5uqu5djwbl0zcd4g9onue26a8nq97c0m9wp6kir1gibuyjxpkqpoxwag | grep says | wc -l
0

real	0m0.133s
user	0m0.004s
sys	    0m0.014s

AcceleratedDHT: false

$ time ipfs dht get -v -- /ipns/k51qzi5uqu5dmjnxaa3ed22c0a6xaoyw5d75uacizdpc4u9rmxnqialvwat8y9 | grep says | wc -l
88

real	0m20.587s
user	0m0.001s
sys	    0m0.080s

$ time ipfs dht get -v -- /ipns/k51qzi5uqu5djwbl0zcd4g9onue26a8nq97c0m9wp6kir1gibuyjxpkqpoxwag | grep says | wc -l
0

real	0m0.198s
user	0m0.011s
sys	    0m0.035s

I’m not sure if the full output from ipfs dht get is useful so I’ve omitted it for now. Please let me know and I’ll include it if necessary.

Are there any other diagnostics that could aid in troubleshooting this?

That intresting so it seems to be dht related, I recently merged some fixes for a latency bug in the DHT, but I’m suprised this would effect this, could you please retry with 0.19-rc1 when it is released (planned for this week) ?

Absolutely! I’ll keep an eye on the releases this week and report back with my findings. Thank you.

Can this be a resource manager issue?

Maybe, we would expect the inverse behaviour tho (accelerated slowing down and normal client working more or less as usual).

The normal client does more work in a single DHT lookup than the accelerated client, so with no more info it still seems possible that libp2p resource manager is throttling and slowing things down.

Could you try running the same tests on v18.0.1 with resource manager disabled? That’s Swarm.ResourceMgr.Enabled = false. This will at least tell us if it’s RM or not, if it is then we can work on tweaking the RM limits.

1 Like

Disabling ResourceMgr appeared to partially alleviate some of the poor gateway performance for one of the test cases but the rest are still unusable:

v18.0.1
Swarm.ResourceMgr.Enabled = false

Unreachable/slow IPNS gateway request

$ time ipfs dht get -v -- /ipns/k51qzi5uqu5dmjnxaa3ed22c0a6xaoyw5d75uacizdpc4u9rmxnqialvwat8y9 | grep says | wc -l
10

real	0m0.371s
user	0m0.010s
sys	    0m0.009s

# Extremely slow
$ time curl -so /dev/null http://localhost:8080/ipns/k51qzi5uqu5dmjnxaa3ed22c0a6xaoyw5d75uacizdpc4u9rmxnqialvwat8y9/

real	3m0.016s
user	0m0.013s
sys	    0m0.006s

Unreachable/slow IPNS gateway request

$ time ipfs dht get -v -- /ipns/k51qzi5uqu5dm7gcgw5wxlf81aszjbihejohdv6pjewxwhtsjoimufognxl8fz | grep says | wc -l
10

real	0m0.324s
user	0m0.008s
sys	    0m0.009s

# Extremely slow
$ time curl -so /dev/null http://localhost:8080/ipns/k51qzi5uqu5dm7gcgw5wxlf81aszjbihejohdv6pjewxwhtsjoimufognxl8fz/

real	3m0.122s
user	0m0.006s
sys	    0m0.012s

Successful IPNS gateway request

$ time ipfs dht get -v -- /ipns/k51qzi5uqu5djwbl0zcd4g9onue26a8nq97c0m9wp6kir1gibuyjxpkqpoxwag | grep says | wc -l
0

real	0m0.124s
user	0m0.003s
sys	    0m0.018s

# Noticeable improvement
$ time curl -so /dev/null http://localhost:8080/ipns/k51qzi5uqu5djwbl0zcd4g9onue26a8nq97c0m9wp6kir1gibuyjxpkqpoxwag/

real	0m0.216s
user	0m0.000s
sys	    0m0.010s

I’m not really sure why one of the test sites is able to be retrieved but the others are not. I’ve got plenty of additional examples if those would be helpful.

Just tried again with v0.20.0 and got the same results - IPNS performance is just incredibly slow. @Jorropo do you have any other suggestions that I can try?

edit - I should mention that almost every single IPNS gateway request completes approximately in 1m. Is this some kind of configurable timeout?

1 Like

:pray: @eth-limo,
I’m still not sure what’s causing this issues :face_with_head_bandage: back to testing multiple gateways, services and versions.

IPNS published from web3.storage/w3name service

IPNS published from https://dwebservices.xyz/

I’m update this with results from local/browser Helia nodes and Fleek/other services soon…

Testing again with v0.20.0 without --enable-pubsub-experiment and --enable-namesys-pubsub seemed to do the trick!

time ipfs resolve -r /ipns/k51qzi5uqu5djhw4rskl5h4ylujy8cwbhn3bcmgwxjzclrebis7uusb1gnfnmo/
/ipfs/QmXGSRcFfVhvwepRZFbg97AkTRo6kt23UDCiMYLcEzv8FC

real	0m0.704s
user	0m0.018s
sys	   0m0.033s

I followed the instructions here for debugging. I found many instances of hung goroutines relating to these flags:

goroutine 1435999 [select, 1542 minutes]:
github.com/libp2p/go-yamux/v4.(*Stream).Read(0xc00baae2a0, {0xc0093a0630, 0x1, 0x1})
        github.com/libp2p/go-yamux/v4@v4.0.0/stream.go:111 +0x1a9
github.com/libp2p/go-libp2p/p2p/muxer/yamux.(*stream).Read(0xe7707c?, {0xc0093a0630?, 0x100c00bc1d2c0?, 0x203000?})
        github.com/libp2p/go-libp2p@v0.27.3/p2p/muxer/yamux/stream.go:17 +0x1e
github.com/libp2p/go-libp2p/p2p/net/swarm.(*Stream).Read(0xc00976a700, {0xc0093a0630?, 0x80?, 0x7fad1e107430?})
        github.com/libp2p/go-libp2p@v0.27.3/p2p/net/swarm/swarm_stream.go:55 +0x33
github.com/multiformats/go-multistream.(*lazyClientConn[...]).Read(0xc0047f5ea0?, {0xc0093a0630?, 0x1?, 0x1?})
        github.com/multiformats/go-multistream@v0.4.1/lazyClient.go:68 +0xb8
github.com/libp2p/go-libp2p/p2p/host/basic.(*streamWrapper).Read(0xc0078c1530?, {0xc0093a0630?, 0x3c500?, 0xc710302babf?})
        github.com/libp2p/go-libp2p@v0.27.3/p2p/host/basic/basic_host.go:1156 +0x27
github.com/libp2p/go-libp2p-pubsub.(*PubSub).handlePeerDead(0x0?, {0x2c0f6c0, 0xc0084fd640})
        github.com/libp2p/go-libp2p-pubsub@v0.9.3/comm.go:147 +0x86
created by github.com/libp2p/go-libp2p-pubsub.(*PubSub).handleNewPeer
        github.com/libp2p/go-libp2p-pubsub@v0.9.3/comm.go:128 +0x398
goroutine 586 [select, 89 minutes]:
github.com/ipfs/boxo/namesys/republisher.(*Republisher).Run(0xc001100870, {0x2c060b8, 0xc0012a8480})
        github.com/ipfs/boxo@v0.8.1/namesys/republisher/repub.go:79 +0x12a
github.com/jbenet/goprocess.(*process).Go.func1()
        github.com/jbenet/goprocess@v0.1.4/impl-mutex.go:134 +0x36
created by github.com/jbenet/goprocess.(*process).Go
        github.com/jbenet/goprocess@v0.1.4/impl-mutex.go:133 +0x238

I’m not entirely sure what changed starting in v0.16.0 with regards to these features. I did find an open issue regarding the deprecation of namesys/pubsub in general, so it might be related. Thank you to everyone who assisted in troubleshooting this problem. I’m more than happy to provide additional findings if this can help anyone else.

2 Likes