Extreme Slowness in IPFS Node Sync and Metadata Access on GCP

Hi everyone,
In our infrastructure on Google Cloud Platform (GCP), we have two IPFS nodes and we have noticed extreme slowness in the distribution of uploaded files and accessing the metadata stored on IPFS.

Current Configuration

IPFS Nodes:

  • Node 1: Accelerated DHT disabled, public IP, TCP port 4001 open.
  • Node 2: Accelerated DHT enabled, public IP, TCP port 4001 open.

Hardware:

  • VM Type: e2-standard-2 (2 vCPU, 8 GB RAM)
  • Disks: 50 GB for the operating system, 200 GB additional for IPFS data
  • Operating System: Ubuntu 22.04.2 LTS

Software Version:

  • Current Version: IPFS 0.29.0
  • Previous Version: IPFS 0.20.0 (no slowness issues when it was functioning)

Issues Encountered

File Distribution:

  • The distribution of files takes hundreds of hours, as evidenced by the logs.
  • Specific log from IPFS1:

vbnet

2024-08-02 19:30:23 ⚠️ Your system is struggling to keep up with DHT reprovides!
2024-08-02 19:30:23 This means your content could partially or completely inaccessible on the network.
2024-08-02 19:30:23 We observed that you recently provided 128 keys at an average rate of 1m20.099219968s per key.
2024-08-02 19:30:23
2024-08-02 19:30:23 đź’ľ Your total CID count is ~29683 which would total at 660h26m25.146310144s reprovide process.
2024-08-02 19:30:23
2024-08-02 19:30:23 ⏰ The total provide time needs to stay under your reprovide interval (22h0m0s) to prevent falling behind!
2024-08-02 19:30:23
2024-08-02 19:30:23 đź’ˇ Consider enabling the Accelerated DHT to enhance your reprovide throughput.

Metadata Access:

  • We also experience significant delays in accessing the metadata stored on IPFS.

Firewall Configuration

  • Open Ports: TCP 4001 for both nodes.

Symptoms

  • Even with adequate CPU, memory, and storage resources, the slowness persists.
  • We have verified that the nodes are correctly advertising their data in the DHT.

Resolution Attempts

Enabling Accelerated DHT:

  • On Node 2, the Accelerated DHT was enabled, but it did not resolve the issue.

Resource Check:

  • Increased RAM on Node 2 to better handle the load with Accelerated DHT.

Port Configuration:

  • Confirmed that only TCP port 4001 is open, as recommended.

Consultation of Forums and Resources:

  • Consulted various online sources suggesting difficulties in finding appropriate peers as a possible cause.

Observations

We found that “the provider calls keep searching for longer for appropriate peers. Undiallable nodes are a big issue right now”. This suggests that there are many unreachable nodes causing significant delays.

Request for Support

We are unable to understand what is causing this extreme slowness, especially considering that this problem did not occur with the previous software version. Can anyone help us identify and resolve the issue?

Thank you

Have you used chatgpt to write this?

Are the logs also generated?

Hi Ector, You’re correct; I’m not a developer myself, so I used ChatGPT to help structure and formulate the post for clarity. However, the logs and the technical details provided are all original and based on the actual data from our system.
This is the entire warning in the IPFS1 log:

2024-08-02 19:30:23 :warning: Your system is struggling to keep up with DHT reprovides!
2024-08-02 19:30:23 This means your content could partially or completely inaccessible on the network.
2024-08-02 19:30:23 We observed that you recently provided 128 keys at an average rate of 1m20.099219968s per key.
2024-08-02 19:30:23
2024-08-02 19:30:23 :floppy_disk: Your total CID count is ~29683 which would total at 660h26m25.146310144s reprovide process.
2024-08-02 19:30:23
2024-08-02 19:30:23 :alarm_clock: The total provide time needs to stay under your reprovide interval (22h0m0s) to prevent falling behind!
2024-08-02 19:30:23
2024-08-02 19:30:23 :bulb: Consider enabling the Accelerated DHT to enhance your reprovide throughput. See:

2024-08-02 19:30:23 kubo/docs/config.md at master · ipfs/kubo · GitHub

do you have suggestions?

If you enabled AcceleratedDHT on node2, and that did not resolve the issue, what is the issue now? Is it still printing this message?

After enabling the Accelerated DHT on Node 2, we noticed some improvement, but we’re still encountering significant issues. Although Node 2 is performing slightly better, it continues to struggle. The log shows:

2024-08-02 18:17:47 :floppy_disk: Your total CID count is ~29665 which would total at 352h44m41.45360112s reprovide process.

Additionally, we continue to face difficulties accessing metadata. Sometimes, we can access the metadata, but often we receive a 504 Gateway Timeout. For example, using the CID Inspector (https://cid.ipfs.tech/?utm_source=bifrost&utm_medium=ipfsio&utm_campaign=error_pages#QmcxhZkj1sBTpmHDP9XK6wj7kFHTeB6Qgw3iwvvxrKhpsW), it appears that the NFT with the hash QmcxhZkj1sBTpmHDP9XK6wj7kFHTeB6Qgw3iwvvxrKhpsW works fine. However, when trying to access the metadata directly via the link https://dweb.link/ipfs/QmcxhZkj1sBTpmHDP9XK6wj7kFHTeB6Qgw3iwvvxrKhpsW we frequently encounter a 504 Gateway Timeout.

We are unable to understand what is causing this extreme slowness, especially since this issue did not occur with the previous version of the software.

The CID inspector only bisects a CID. Use https://ipfs-check.on.fleek.co/ instead for debugging.

Perhaps your machine does not have enough bandwidth and/or is being used for other things… you should be able to use regular system administration monitoring tools and metrics to figure out what the bottleneck is, at least at high level.

Hi Hector, we’ve checked and confirmed that the resources allocated to the VMs dedicated to this task are sufficient.
I ran the test twice with the IPFS Check tool. The first test produced the following results:
https://ipfs-check.on.fleek.co/?cid=QmeXrKrrKkZM3E1uCrjrfuqNgfgB77m5oMCSxYUTFthLhx&multiaddr=%2Fp2p%2F12D3KooWCUAtpadjtg6nbyRVN1QKKFCCUDvJc3xcoCDZVk5CrB12

However, when I attempted the test again a few days later, it failed and returned the following error: :warning: backend error: TypeError: Failed to fetch.

Could you please advise on what might be causing the “Could not find the multihash in the DHT” issue? Is there something specific we should be checking or adjusting in our configuration?

This means that you are not providing the hash timely.

You mentioned that the previous version did not suffer this problem. What about the latest (0.30.0-rc2 I think)?

Among the things that have changed between 0.20.0 and 0.30.0 are probably additional transports (i.e. quic?) and perhaps changes to the resource manager. You can check ipfs swarm resources to see if any resources are at >90% usage.

But also worth knowing if your machine’s bandwidth usage is perhaps max’ed out now, or something is causing slowness when providing as it seems to be too slow and that might be a constraint in connections to other peers, or a constaint in bandwidth.

1 Like

From a test I conducted, it seems that QUIC isn’t working on your node:

I would consider making sure that UDP port 4001 is open to make use of QUIC.

Additionally, update the addresses in IPFS config "Addresses.Swarm from /quic to /quic-v1

For more information, see:

QUIC should both help with “cheaper” connectivity and much fewer roundtrips than TCP. This can have quite the impact when doing DHT provides to many peers.

Final note about kubo/0.30.0-rc2

I also see that you’re running kubo/0.30.0-rc2/. We’ve discovered a bug in this version. So I’d recommend downgrading to kubo/0.30.0-rc1/ or to kubo/0.29.0/.

Hi Daniel and Hector,Thank you again for your suggestions.

We’ve updated the Swarm addresses to use /quic-v1, and confirmed that UDP port 4001 was already open. After enabling Accelerated DHT on Node 2, it stabilized at around 3000 peers without the RAM spikes we experienced earlier. Node 1 remains stable with around 2600 peers, and we’re waiting to test further before enabling Accelerated DHT on Node 1.

As for the Kubo version, we upgraded to v0.30.0-rc2 as recommended by Hector. For now, we’re avoiding too many changes at once and are monitoring the system’s performance with QUIC v1 enabled and Accelerated DHT running on Node 2. @hector I’d appreciate your thoughts on the bug Daniel mentioned with v0.30.0-rc2 before we consider any downgrades.

When I initially tested with IPFS Check, I was receiving the error “multihash not found in DHT.” However, now when I run the test, I consistently encounter the following error:

:warning: backend error: TypeError: Failed to fetch

Because of this, I’m unable to verify whether the multihash DHT issue has been resolved or not. Despite this, I’ve noticed improvements in accessing metadata overall.

Do you have any suggestions on how I can verify that everything is functioning correctly, given the recurring backend error?

1 Like

We just released v0.30.0-rc3 last night. So feel free to bump it to that version rather than downgrading, to make sure you don’t hit any of those problems specific to rc2.

You can follow the release here Release 0.30 · Issue #10436 · ipfs/kubo · GitHub, there’s a good chance that rc3 will be the final release candidate.

Can you please share the url for the check? (with the CID and multiaddr in the query params)

Thanks for the heads-up on the release.

Here is the link you requested with the CID and multiaddr included: IPFS Check

You are missing the Peer ID in the multiaddr of the check.

Make sure to add it as follows: /ip4/34.79.65.170/tcp/4001/p2p/PEER_ID

Also, thanks for the heads up. A good opportunity to improve the error for that case and make it less confusing.

Btw, if you run the check without a multiaddr, you will see that it can’t find any providers, proving that it’s having trouble announcing the provider record to the DHT.

Do you have any suggestions on how to resolve the issue with announcing the provider record to the DHT?

That’s exactly what the Accelerated DHT Client should help with.

This is what’s currently available in lieu of improvements to the DHT implementation (like batch provides and reprovider sweep)

Hi Daniel, We have two IPFS nodes, and here are the URLs with the peer IDs configured for both nodes:

However, both tests continue to return the following error:

:warning: backend error: TypeError: Failed to fetch

I’m wondering if this is a localized issue or if there is a broader problem with the backend. The IPFS Check seems to rely on https://ipfs-check-backend.ipfs.io, and as IPFS.io is currently down, could this be part of the problem?

We’ve also installed the rc3 version on Node 2, but I’m not too convinced about its performance so far. Initially, the node started up well with 2600 peers, but it’s now dropped to below 200:

ipfs@ipfs-node-2:/$ ipfs swarm peers | wc -l
2609
ipfs@ipfs-node-2:/$ ipfs swarm peers | wc -l
243
ipfs@ipfs-node-2:/$ ipfs swarm peers | wc -l
190

I’m keeping an eye on it for now. Any insights on these fluctuations or suggestions regarding the fetch error would be appreciated.

I’ve identified that the previous issue with the backend error: TypeError: Failed to fetch was due to my internet service provider, so that problem is now resolved. However, after running some tests on both nodes, I’m encountering the following issues:

Node 1 (v0.30, no Accelerated DHT)

  • :x: Could not connect to multiaddr: failed to dial
  • :white_check_mark: Found multiaddr with 10 DHT peers
  • :x: Could not find the multihash in the DHT
  • :x: The peer did not quickly respond if it had the CID

Node 2 (v0.30-rc3, Accelerated DHT enabled)

  • :x: Could not connect to multiaddr: failed to dial
  • :x: Could not find the given multiaddr in the DHT (found a different one instead)
  • :white_check_mark: Found multihash advertised in the DHT
  • :x: The peer did not quickly respond if it had the CID

It seems like the connection to the multiaddr on both nodes is failing with a dial backoff error, and for Node 1, I’m unable to retrieve the multihash from the DHT. Node 2, which is running with Accelerated DHT, can find the multihash, but is still facing connection issues.

How can we resolve the dial backoff error and the multihash retrieval issue on both nodes?

First off, I should point out that you can test a CID and a PeerID without the full multiaddr


Based on the results it appears that Peer 1 is undialable and does not advertise the multihash in the DHT. I would double check your networking configuration and firewall to fix the dialability/connection problem.

As for Peer 2, it seems to be functioning fine. It advertises correctly and I can connect with both QUIC and TCP (I tried both manually).

1 Like