Debugging slow content discovery

Can you describe more about your setup where you’re seeing 10-20 minutes for node B to be able to resolve the hash for the content added by node A? The response times I’ve seen have typically been very fast for /ipfs/ paths – even ones where I know there’s only one node that can provide the content.

It was an audio file of few MB.
Node A was behind NAT, but with 4001 open, node B was behind NAT, but without 4001 open.
These are the test URLs: http://plesko.si/test/

The problem was just when it was added, i.e. the first 10-20 min. Afterwards it was quick.

This wouldn’t be much of a problem for some super static content, but it would for when you just want to share a file right now or dynamic content.

If I create a random file and add it to a node behind NAT with UPnP enabled, I can resolve it as fast as I can navigate to it in a browser from a non-default gateway (less than 500ms using the example below). I think the only difference between this scenario and yours is that the gateway is likely not behind a NAT with a closed port; but I’m not sure why that would slow down your requests from node B, though.

dd if=/dev/urandom of=random bs=1m count=10
hash=$(ipfs add --pin=false -q random)
time curl https://cloudflare-ipfs.com/ipfs/$hash > /dev/null # this happens very quickly on my NAT'd laptop
rm random

OK, I’ll test on various setups.

You have a typo, it should be bs=1M not bs=1m.
At least for me it doesn’t work with a lowercase.

I ran your script (with bs=1M) on node A (behind NAT, with open port 4001, not UPnP).
So it generates a 10 MB file, puts it on ipfs, downloads it from cloudflare-ipfs.

1st trial: 1 min 42 seconds
2nd trial: 0 min 55 seconds
3rd trial: 1 min 42 seconds
4th trial: 1 min 42 seconds
5th trial: 1 min 42 seconds

Seems to need more than your case, but not as much as my past cases.
The internet speed is 20/20 mbit (real mbits), that would mean best case real scenario let’s say 10-20 seconds, so 55 seconds is alright (no idea why it came once?), 1 min 42 sec is a bit high, but still conditionally alright.

Node B (behind NAT, with closed ports) is far more interesting and I think I’ll do further tests.
I did some and insofar this seems to be a problem, even in cases when it shouldn’t be, i.e. node A/node B should form a direct connection, am I correct?

For example, I did a modified experiment.
On Node A I’ve generated the random file, put it on ipfs but just printed out the hash instead of downloading the file off cloudflare.
Then I went on Node B, did ipfs get [hash] and waited.
It needed 13 minutes to download.
After it finished on Node B I did on Node A time curl from cloudflare and it needed for the same file 1 min 42 s.

Then I’ve tried a second experiment.
I first put on Node A on ipfs a random file, then curl from cloudlfare (again 1 min 42 s), but then I went on Node B and did a ipfs get [hash]. It needed 14 seconds. Fascinating.

I have also Node C (not behind NAT, but with closed ports) that I’ll test, but sadly I don’t have a machine which would have ports open, i.e. that would be like Node A. It could be that Node A/B is problematic.

1 Like

OK, I’ve tried your script with Node C which is on a 1/1 gbit connection with own IP (university computer), but behind firewall so no ports open.

1st trial: 1 min 19 s
2nd trial: 1 min 41 s
3rd trial: 1 min 41 s

Did a modified experiment, where I generated the random file, did ipfs add on Node C, but haven’t requested from cloudflare the file. Instead on Node A, I did ipfs get [hash].
Again the weird downloading, 1 M fast, then wait, then I’m at 5 M fast and wait, then I’m at 6 M, wait again for a long time, then I’m at 8 M for a while. I guess it downloads chunks.
It needed 35 min 21 s in total.

I’m starting to suspect that Node A hasn’t got the ports properly open, but I did nmap [ip of Node A] on Node C for Node A and 4001 is open. If I do on Node C: telnet [ip of Node A] 4001 I get

/multistream/1.0.0

So I guess it works alright?

From what I see insofar is that if I force cloudflare to have my file, by downloading it from it after I put it on IPFS, everything works rather fast. If I don’t do that, then it becomes super slow.

What I suspect is that it’s a problem with DHT searching where the [hash] is. Everyone probably has the Cloudflare node for querying where the [hash] is, but if Cloudflare doesn’t have the [hash], the search becomes way slower. That’s why I suspect I get super fast download for 1 M then wait, then super fast 1-2 M again. It’s because when it locates where a chunk is, then it becomes fast to download it, but the search itself is slow as hell.

From what I gather, if the above is correct, it doesn’t even have the heuristic idea, that if I want the file, which has N chunks and I got one chunk from peer X (I suspect that’s Node A), then most likely peer X has also the other chunks. I don’t know the details how it works (just started with IPFS) and I also don’t know how to test this easily; I don’t see any “ipfs get” verbose mode, where I could see from which peers is downloading.

I re-run now your script on Node C. Now with pure “ipfs daemon”, fresh start, waited 5 min before starting the test scripts. The results were on average better, but there was an interesting result, when I’ve used ipfs.io gateway the first time.

If I download from cloudflare (3 trials): 8,5 s; 47,8 s; 6,2 s
If I download from ipfs.io (3 trials): 2m 6s; 3,0 s; 2,0 s

That 2 m 6s I think is because it was the first time for this machine to use “ipfs.io” and the DHT routing had to discovered? No idea. If that’s so, an another reason why URI hints to peers might be a good idea?

I did the same on Node A. Restarted the ipfs daemon the same way, started your script.
cloudflare: 1m 41s, 1m 41s, 1m 41s
ipfs.io: 29m 1s, 16m 14s, 30m 1s

After I finished, I’ve started the script on Node C, without restarting ipfs since the last experiment:
ipfs.io: [error in HTTP2 framing layer after 9 minutes]
cloudflare: 1m 40s, 1m 40s
ipfs.io: 2,8 s, 1,9 s, [manually cancelled after 10 min]

I started suspecting why cloudflare’s results were so consistent, so I’ve removed the > /dev/null.
Well, after 1m 40s it times out, that’s why ^.^

I’ve tried a new experiment. At the approx. same time, I’ve generated a random file on Node A and a different random file on Node C. I computed of both the SHA256 and put on IPFS.

Then on Node A I started downloading with “ipfs get [hash]” the file from Node C and the opposite on Node C, where I started downloading the file from Node A. Both without first calling ipfs.io/[hash] or cloudflair.

I noticed on both cases the weird way of downloading, where it waited a lot between MB, but when it changed by 1 MB or even by 4 MB it went quick (e.g. was at 2 MB for few minutes then in matter of seconds got to 5 MB, then again a long pause). Also in both cases it needed several minutes to start downloading.

On Node C it downloaded the file from Node A in 8 m 17 s, checksum correct.
On Node A it downloaded the file from Node C in 8 m 16 s, checksum correct.

I did a 2nd trial right after it finished.
On Node C: 13 m 28 s
On Node A: 87 m 3 s

macOS’ dd uses a lower-case “m” while linux’s needs an upper-case “M”. I’m currently on macOS which is why it was lower-case.

How many peers is the node that uploads the file connected to? It’s weird to me that your times to download the file from the originating node through a gateway are still sometimes 1+ minute.

ipfs swarm peers | wc -l

Oh I didn’t know about that macOS/Linux difference, sorry.

I’ve checked the number of peers and Node A had 49 peers, Node C had 637 peers.

I checked the config of Node A and for some reason the routing was set to “dhtclient”.

I was playing a few days ago with IPFS on Node A and did invoke the --routing=dhtclient a few times and applied the profile “lowpower” at some point. Later on, I did apply the profile “default-networking”, which I thought it would set the default values back again. It seems it isn’t the case, read it wrong and looks like that there is no “set to default” option.

I deleted the ipfs config and ran again ipfs init. Now Node A has 307 peers (and raising).

But now it seems that Node A is totally underpowered to handle IPFS without “lowpower” profile, before it seemed like it works nice, because I didn’t know the “lowpower” profile was still applied. Now it’s constantly at 100% CPU. On “lowpower” it isn’t the case, it still consumes CPU but in the range of 20%. Weird.
Node A does good/full speeds with torrents, which I thought would be of similar CPU intensity.

I guess I need to try better test nodes. :blush:

Sorry for all of this.

Thanks for sharing your results, even if the config customizations are responsible. I wouldn’t be surprised if that explains the performance you’re seeing.

In either case, running IPFS from low-powered nodes will hopefully become more common; so proposals like the one in your first post seem useful for thinking about the experience for users on those nodes can be improved.

Yes, well I don’t have very good computers, but if you want that it works also on shittier devices or poorer regions of the world, I think it needs to eat up less resources. The “lowpower” profile is I think good enough for such, but the whole scheme is not then adapted to it, even thou with the simple hack proposed in the 1st post it would be suitable for it as well.

I did it now differently to prove this. Node A is still in “low power”, Node C is default settings.

I tried first on Node C ipfs ping [peer ID of Node A], it timed out.
I went on Node C and did ipfs dht findpeer [peer ID of Node A] and quite quickly, it found the associated IPs, local and exterior ones.
I did again on Node C ipfs ping [peer ID of Node A] and I got an average of 10 ms.

I’ve generated a new random file on Node A and put it on ipfs, waited a bit (no idea how much, let’s say 10 min).
I ran again on Node C the ping and got a weird ping result:

Pong received: time=9353.73 ms
Pong received: time=3177.23 ms
Pong received: time=2466.85 ms
Pong received: time=2485.99 ms
Pong received: time=2424.44 ms
Pong received: time=1739.68 ms
Pong received: time=1299.30 ms
Pong received: time=1078.96 ms
Pong received: time=934.47 ms
Pong received: time=616.13 ms
Average latency: 2557.68ms

Progressively shorter. Did it again.

Pong received: time=40.65 ms
Pong received: time=40.64 ms
Pong received: time=83.18 ms
Pong received: time=76.54 ms
Pong received: time=14.61 ms
Pong received: time=21.42 ms
Pong received: time=10.15 ms
Pong received: time=7.34 ms
Pong received: time=10.06 ms
Pong received: time=9.29 ms
Average latency: 31.39ms

Weird. Like it forgot the peer, found it again, changed the routing while pinging and then did it directly again. No idea?!

Anyway next pings were normal, so I quickly did the ipfs get [hash of random file on Node A] and it needed just 19 s to download, compared to 8-13 minutes before this is stellar.

I pinged then again and it was OK, but then after 10 min I tried again and it was again first 137 ms, then 26 ms, then again in 10 ms approx. Why is this?!

Main point though, if I force the IPFS to connect to Node A, in the before mentioned way, prior to Node C downloading the file, it works approx. 30 times faster, than what it does without “hinting where it is”. So I still think URI hinting with [peer IDs] and maybe not IPs would be stellar. :slight_smile:

Also, is there a traceroute tool for IPFS?

An another thing. Node B and Node C are behind NAT/firewall.
They can’t see each other directly, but they also can’t ping each other through IPFS. I thought this is possible, that automatically a 3rd node with open ports serves as a proxy.
This isn’t the case it seems?

Is there a way to verbosely see what’s happening during download with ipfs get [hash]?

What are the specs for Node A? I used to run a full IPFS node on a $200 laptop with a very weak CPU (but 8GB of RAM) without any issues. From looking at what the lowerpower profile changes, it comes with this warning

Reduces daemon overhead on the system. May affect node
functionality - performance of content discovery and data
fetching may be degraded.

Even outside of the lowpower profile, I’m pretty sure some of the go-ipfs devs are working on reducing resource consumption so that less powerful devices can run it without as many compromises.

Outside of network issues (e.g., link saturation, packet loss) or one of the machines being too busy to respond in a timely manner, I’m not sure what IPFS would be doing to slow it down. What do you get if you just do a regular ping of node A’s external IP?

I agree. This is an interesting idea, but ideally shouldn’t be needed.

Relays are already implemented and the functionality is enabled by default, but if this post is still accurate, then automatic relay node discovery isn’t implemented yet. There’s nothing stopping you from doing it manually, though.

Yes, see this post as an example. You’ll have to filter the debug output only to events you care about because there’s a lot of it.

RAM wasn’t really an issue.
Let’s see,
Node A is a Raspberry Pi 1B with an exterior HDD
Node B is a Lenovo T410, with upgraded RAM to 6 GB.
Node C is good/new.

I’m thinking of trying to make a super basic client to better understand how IPFS works and why it needs this many resources/why it’s slow.

Similar issue in: