Incident Report - Increased Latency on the Amino DHT

TL;DR

  • Since 4 December, the ProbeLab team which monitors the IPFS network observed two major anomalies in Amino (The public IPFS DHT):
    • A significant increase in the number of DHT Server nodes on the Amino IPFS network, which appear as offline nodes according to our crawler. The fact that those nodes are mostly offline, has a direct impact on the latency observed when publishing and fetching content from the network (see next plots). Source

  • Increased latency in publishing to the Amino (Source)

  • Increase latency in looking up records in Amino (source)

  • For the raw data from the latest crawl, see the latest report from the Nebula crawler
    - This may degrade content and peer routing performance on the network.
  • We are investigating the cause of this. Upon initial investigation, it appears that this was the result of a merging of the Avail network with the Amino IPFS DHT.
  • We’re working together with the Avail team to mitigate this.

Background

There is more than one DHT

The Amino DHT used by IPFS and implemented in libp2p (in its various language implementations) is used for content (mapping CIDs → PeerIDs) and peer routing (mapping PeerIDs → IP addresses).

Peers in the DHT use protocol IDs to declare which network they are a part of, for example, IPFS nodes advertise –among other protocols like /ipfs/bitswap/1.2.0 – the following protocol ID: /ipfs/kad/1.0.0.

The versatility of this DHT implementation has lead many other decentralised networks to adopt this implementation. Typically, when a separate network adopts it, it should use a different protocol ID, e.g. /specialnetwork/kad/1.0.0

Merging of DHTs

Based on IPFS network crawls, it appears that the Avail Nodes are publishing the same Protocol IPFS Kademlia protocol ID. Although identical Protocol IDs aren’t enough to cause the two networks to merge (since nodes still need to discover each other), the networks do seem to in fact have merged.

On the surface, there is nothing wrong with that, assuming that peers from both networks adhere to the protocol’s spec and respond to the RPC messages. We suspect that since most of the new DHT servers observed appear offline and latency has increased something may be malfunctioning.

What’s next?

We’re investigating the root cause of this and working on a solution along with the Avail Project team, who have been very responsive and are also investigating solutions from their end.

For now, expect higher latencies for DHT operations on the network. A good resource to monitor to get up-to-date information is: https://probelab.io and in particular the DHT Lookup performance, the DHT Publish latency

We will update this post as soon as we have more information.

1 Like

Update 18-12-2023

The IPFS network performance seems to be getting back to normal levels from what I can see at probelab.io. In particular:

  • No large numbers of new nodes seem to be getting in the network: IPFS DHT | ProbeLab
  • DHT Lookup performance seems to be going back down to pre-incident levels: IPFS DHT | ProbeLab - not 100% there yet, but there’s a clear downward trend
  • There was a very sharp decrease in the DHT Publish performance down to pre-incident levels, which most likely indicates that there aren’t any/too many peers that don’t respond to publish requests: IPFS DHT | ProbeLab
  • There’s still something weird with the overall number of nodes in the network: IPFS KPIs | ProbeLab. ProbeLab infra sees only a few clients and sees only servers, which is not normal. Since this one isn’t urgent, or alarming and could be a misclassification issue, we will resolve this one retrospectively.

Avail’s incident report

The avail team has also published an incident report with more information and the actions they took to resolve this: