Hi everyone,
In our infrastructure on Google Cloud Platform (GCP), we have two IPFS nodes and we have noticed extreme slowness in the distribution of uploaded files and accessing the metadata stored on IPFS.
Current Configuration
IPFS Nodes:
Node 1: Accelerated DHT disabled, public IP, TCP port 4001 open.
Node 2: Accelerated DHT enabled, public IP, TCP port 4001 open.
Hardware:
VM Type: e2-standard-2 (2 vCPU, 8 GB RAM)
Disks: 50 GB for the operating system, 200 GB additional for IPFS data
Operating System: Ubuntu 22.04.2 LTS
Software Version:
Current Version: IPFS 0.29.0
Previous Version: IPFS 0.20.0 (no slowness issues when it was functioning)
Issues Encountered
File Distribution:
The distribution of files takes hundreds of hours, as evidenced by the logs.
Specific log from IPFS1:
vbnet
2024-08-02 19:30:23 ⚠️ Your system is struggling to keep up with DHT reprovides!
2024-08-02 19:30:23 This means your content could partially or completely inaccessible on the network.
2024-08-02 19:30:23 We observed that you recently provided 128 keys at an average rate of 1m20.099219968s per key.
2024-08-02 19:30:23
2024-08-02 19:30:23 đź’ľ Your total CID count is ~29683 which would total at 660h26m25.146310144s reprovide process.
2024-08-02 19:30:23
2024-08-02 19:30:23 ⏰ The total provide time needs to stay under your reprovide interval (22h0m0s) to prevent falling behind!
2024-08-02 19:30:23
2024-08-02 19:30:23 đź’ˇ Consider enabling the Accelerated DHT to enhance your reprovide throughput.
Metadata Access:
We also experience significant delays in accessing the metadata stored on IPFS.
Firewall Configuration
Open Ports: TCP 4001 for both nodes.
Symptoms
Even with adequate CPU, memory, and storage resources, the slowness persists.
We have verified that the nodes are correctly advertising their data in the DHT.
Resolution Attempts
Enabling Accelerated DHT:
On Node 2, the Accelerated DHT was enabled, but it did not resolve the issue.
Resource Check:
Increased RAM on Node 2 to better handle the load with Accelerated DHT.
Port Configuration:
Confirmed that only TCP port 4001 is open, as recommended.
Consultation of Forums and Resources:
Consulted various online sources suggesting difficulties in finding appropriate peers as a possible cause.
Observations
We found that “the provider calls keep searching for longer for appropriate peers. Undiallable nodes are a big issue right now”. This suggests that there are many unreachable nodes causing significant delays.
Request for Support
We are unable to understand what is causing this extreme slowness, especially considering that this problem did not occur with the previous software version. Can anyone help us identify and resolve the issue?
Hi Ector, You’re correct; I’m not a developer myself, so I used ChatGPT to help structure and formulate the post for clarity. However, the logs and the technical details provided are all original and based on the actual data from our system.
This is the entire warning in the IPFS1 log:
2024-08-02 19:30:23 Your system is struggling to keep up with DHT reprovides!
2024-08-02 19:30:23 This means your content could partially or completely inaccessible on the network.
2024-08-02 19:30:23 We observed that you recently provided 128 keys at an average rate of 1m20.099219968s per key.
2024-08-02 19:30:23
2024-08-02 19:30:23 Your total CID count is ~29683 which would total at 660h26m25.146310144s reprovide process.
2024-08-02 19:30:23
2024-08-02 19:30:23 The total provide time needs to stay under your reprovide interval (22h0m0s) to prevent falling behind!
2024-08-02 19:30:23
2024-08-02 19:30:23 Consider enabling the Accelerated DHT to enhance your reprovide throughput. See:
After enabling the Accelerated DHT on Node 2, we noticed some improvement, but we’re still encountering significant issues. Although Node 2 is performing slightly better, it continues to struggle. The log shows:
2024-08-02 18:17:47 Your total CID count is ~29665 which would total at 352h44m41.45360112s reprovide process.
We are unable to understand what is causing this extreme slowness, especially since this issue did not occur with the previous version of the software.
Perhaps your machine does not have enough bandwidth and/or is being used for other things… you should be able to use regular system administration monitoring tools and metrics to figure out what the bottleneck is, at least at high level.
However, when I attempted the test again a few days later, it failed and returned the following error: backend error: TypeError: Failed to fetch.
Could you please advise on what might be causing the “Could not find the multihash in the DHT” issue? Is there something specific we should be checking or adjusting in our configuration?
This means that you are not providing the hash timely.
You mentioned that the previous version did not suffer this problem. What about the latest (0.30.0-rc2 I think)?
Among the things that have changed between 0.20.0 and 0.30.0 are probably additional transports (i.e. quic?) and perhaps changes to the resource manager. You can check ipfs swarm resources to see if any resources are at >90% usage.
But also worth knowing if your machine’s bandwidth usage is perhaps max’ed out now, or something is causing slowness when providing as it seems to be too slow and that might be a constraint in connections to other peers, or a constaint in bandwidth.
QUIC should both help with “cheaper” connectivity and much fewer roundtrips than TCP. This can have quite the impact when doing DHT provides to many peers.
Final note about kubo/0.30.0-rc2
I also see that you’re running kubo/0.30.0-rc2/. We’ve discovered a bug in this version. So I’d recommend downgrading to kubo/0.30.0-rc1/ or to kubo/0.29.0/.
Hi Daniel and Hector,Thank you again for your suggestions.
We’ve updated the Swarm addresses to use /quic-v1, and confirmed that UDP port 4001 was already open. After enabling Accelerated DHT on Node 2, it stabilized at around 3000 peers without the RAM spikes we experienced earlier. Node 1 remains stable with around 2600 peers, and we’re waiting to test further before enabling Accelerated DHT on Node 1.
As for the Kubo version, we upgraded to v0.30.0-rc2 as recommended by Hector. For now, we’re avoiding too many changes at once and are monitoring the system’s performance with QUIC v1 enabled and Accelerated DHT running on Node 2. @hector I’d appreciate your thoughts on the bug Daniel mentioned with v0.30.0-rc2 before we consider any downgrades.
When I initially tested with IPFS Check, I was receiving the error “multihash not found in DHT.” However, now when I run the test, I consistently encounter the following error:
backend error: TypeError: Failed to fetch
Because of this, I’m unable to verify whether the multihash DHT issue has been resolved or not. Despite this, I’ve noticed improvements in accessing metadata overall.
Do you have any suggestions on how I can verify that everything is functioning correctly, given the recurring backend error?
We just released v0.30.0-rc3 last night. So feel free to bump it to that version rather than downgrading, to make sure you don’t hit any of those problems specific to rc2.
Btw, if you run the check without a multiaddr, you will see that it can’t find any providers, proving that it’s having trouble announcing the provider record to the DHT.
However, both tests continue to return the following error:
backend error: TypeError: Failed to fetch
I’m wondering if this is a localized issue or if there is a broader problem with the backend. The IPFS Check seems to rely on https://ipfs-check-backend.ipfs.io, and as IPFS.io is currently down, could this be part of the problem?
We’ve also installed the rc3 version on Node 2, but I’m not too convinced about its performance so far. Initially, the node started up well with 2600 peers, but it’s now dropped to below 200:
I’ve identified that the previous issue with the backend error: TypeError: Failed to fetch was due to my internet service provider, so that problem is now resolved. However, after running some tests on both nodes, I’m encountering the following issues:
Node 1 (v0.30, no Accelerated DHT)
Could not connect to multiaddr: failed to dial
Found multiaddr with 10 DHT peers
Could not find the multihash in the DHT
The peer did not quickly respond if it had the CID
Node 2 (v0.30-rc3, Accelerated DHT enabled)
Could not connect to multiaddr: failed to dial
Could not find the given multiaddr in the DHT (found a different one instead)
Found multihash advertised in the DHT
The peer did not quickly respond if it had the CID
It seems like the connection to the multiaddr on both nodes is failing with a dial backoff error, and for Node 1, I’m unable to retrieve the multihash from the DHT. Node 2, which is running with Accelerated DHT, can find the multihash, but is still facing connection issues.
How can we resolve the dial backoff error and the multihash retrieval issue on both nodes?
Based on the results it appears that Peer 1 is undialable and does not advertise the multihash in the DHT. I would double check your networking configuration and firewall to fix the dialability/connection problem.
As for Peer 2, it seems to be functioning fine. It advertises correctly and I can connect with both QUIC and TCP (I tried both manually).