Incident Report - Increased Latency on the Amino DHT

danieln · December 12, 2023, 2:46pm

TL;DR

Since 4 December, the ProbeLab team which monitors the IPFS network observed two major anomalies in Amino (The public IPFS DHT):
- A significant increase in the number of DHT Server nodes on the Amino IPFS network, which appear as offline nodes according to our crawler. The fact that those nodes are mostly offline, has a direct impact on the latency observed when publishing and fetching content from the network (see next plots). Source

Increased latency in publishing to the Amino (Source)

Increase latency in looking up records in Amino (source)

For the raw data from the latest crawl, see the latest report from the Nebula crawler
- This may degrade content and peer routing performance on the network.
We are investigating the cause of this. Upon initial investigation, it appears that this was the result of a merging of the Avail network with the Amino IPFS DHT.
We’re working together with the Avail team to mitigate this.

Background

There is more than one DHT

The Amino DHT used by IPFS and implemented in libp2p (in its various language implementations) is used for content (mapping CIDs → PeerIDs) and peer routing (mapping PeerIDs → IP addresses).

Peers in the DHT use protocol IDs to declare which network they are a part of, for example, IPFS nodes advertise –among other protocols like /ipfs/bitswap/1.2.0 – the following protocol ID: /ipfs/kad/1.0.0.

The versatility of this DHT implementation has lead many other decentralised networks to adopt this implementation. Typically, when a separate network adopts it, it should use a different protocol ID, e.g. /specialnetwork/kad/1.0.0

Merging of DHTs

Based on IPFS network crawls, it appears that the Avail Nodes are publishing the same Protocol IPFS Kademlia protocol ID. Although identical Protocol IDs aren’t enough to cause the two networks to merge (since nodes still need to discover each other), the networks do seem to in fact have merged.

On the surface, there is nothing wrong with that, assuming that peers from both networks adhere to the protocol’s spec and respond to the RPC messages. We suspect that since most of the new DHT servers observed appear offline and latency has increased something may be malfunctioning.

What’s next?

We’re investigating the root cause of this and working on a solution along with the Avail Project team, who have been very responsive and are also investigating solutions from their end.

For now, expect higher latencies for DHT operations on the network. A good resource to monitor to get up-to-date information is: https://probelab.io and in particular the DHT Lookup performance, the DHT Publish latency

We will update this post as soon as we have more information.

danieln · December 18, 2023, 12:07pm

Update 18-12-2023

The IPFS network performance seems to be getting back to normal levels from what I can see at probelab.io. In particular:

No large numbers of new nodes seem to be getting in the network: IPFS DHT | ProbeLab
DHT Lookup performance seems to be going back down to pre-incident levels: IPFS DHT | ProbeLab - not 100% there yet, but there’s a clear downward trend
There was a very sharp decrease in the DHT Publish performance down to pre-incident levels, which most likely indicates that there aren’t any/too many peers that don’t respond to publish requests: IPFS DHT | ProbeLab
There’s still something weird with the overall number of nodes in the network: IPFS KPIs | ProbeLab. ProbeLab infra sees only a few clients and sees only servers, which is not normal. Since this one isn’t urgent, or alarming and could be a misclassification issue, we will resolve this one retrospectively.

Avail’s incident report

The avail team has also published an incident report with more information and the actions they took to resolve this:

github.com

availproject/incident-reports/blob/main/2023/2023-12-04-IPFS-network-merge.md

# 2023-12-04 IPFS network merge

## Authors

[@sh3ll3x3c](https://github.com/sh3ll3x3c)

## Status

Most immediate remedies have been deployed on the Avail Goldberg testnet. Monitoring the situation.

## Summary

Kademlia Distributed Hash Table (DHT) tables of IPFS (Amino) and Avail light client (LC) network were inadvertently merged, causing a significant network performance degradation both on the Avail LC P2P network and IPFS.

## Impact

Avail light client P2P network suffered a significant network degradation measured in overall increased latency for both `get` and `put` Kademlia operations, frequent query timeouts, and prolonged bootstrap process, all of which resulted in decreased DHT random sampling hit rate which is its primary purpose.

On the other side, IPFS Amino DHT suffered increased latency across the board, with more details found in the incident [report](https://discuss.ipfs.tech/t/incident-report-increased-latency-on-the-amino-dht/17338).

This file has been truncated. show original

Topic		Replies	Views
Amino (the Public IPFS DHT) is getting a facelift \| IPFS Blog & News Blog Posts	0	152	September 27, 2023
DHT discussion and contribution opportunities in 2023Q4 Protocol dht , kubo	3	766	January 8, 2024
Reprovider 28m per key go-ipfs , dht	5	125	May 29, 2024
Measurement-based research Paper: "Mapping the Interplantery Filesystem" Ecosystem and Usage	0	422	February 19, 2020
Amino is the proposed name for the Public IPFS DHT Protocol dht , kubo	0	244	September 5, 2023