Connection closes during bitswap fetches

We have recently made first attempts at integrating Helia with our p2p database system, OrbitDB.

In most cases, Helia is running without issue but we are experiencing connection dropouts during block sync-ing and we have isolated the problem to js-ipfs-bitswap. In particular, it appears that the LibP2P connection closes the connection between two browser peers, causing our database sync-ing protocol to break. This issue is not reproducable when running sync-ing between Node.js peers.

To replicate the problem:

Clone OrbitDB, checkout Helia branch and install dependencies:

git clone https://github.com/orbitdb/orbitdb.git
cd ./orbitdb
git checkout helia
npm i

Launch the relay:

npm run webrtc

Run the web browser tests:

npm run test:browser

The test should run successfully.

Next, open the file ./test/orbitdb-replication.test.js and change line 35 to read:

const amount = 85 + 1

Save the changes.

Run the browser test again:

npm run test:browser

It will time out. 86 records seems to be the magic number whereby the sync will no long complete successfully. If it is still running sucessfully, increment to const amount = 128 + 1. At some point, the number of records that will be sync-ed will cause the sync-ing to hang and time out.

We have traced the problem to Bitswap

When 85 + 1 or greater is specified, it seems to hang Promise.race eventually times out. In particular, I think it is loadOrFetchFromNetwork not resolving; I’m guessing the first promise will not resolve while the block is not stored locally. In particular onBlock does not seem to get fired.

When logging is enabled, we are seeing the following errors in msg_queue when sendMessage gets called:

send error CodeError: stream reset
    at MplexStream.reset (/home/haydenyoung/Development/orbitdb/orbit-db/test/browser/webpack:/@orbitdb/core/node_modules/@libp2p/interface-stream-muxer/dist/src/stream.js:144:1)
    at MplexStreamMuxer._handleIncoming (/home/haydenyoung/Development/orbitdb/orbit-db/test/browser/webpack:/@orbitdb/core/node_modules/@libp2p/mplex/dist/src/mplex.js:260:1)
    at MplexStreamMuxer.sink (/home/haydenyoung/Development/orbitdb/orbit-db/test/browser/webpack:/@orbitdb/core/node_modules/@libp2p/mplex/dist/src/mplex.js:160:1)
    at async Promise.all (index 0)
send error Error: Muxer already closed
    at MplexStreamMuxer.newStream (/home/haydenyoung/Development/orbitdb/orbit-db/test/browser/webpack:/@orbitdb/core/node_modules/@libp2p/mplex/dist/src/mplex.js:93:1)
    at ConnectionImpl.newStream [as _newStream] (/home/haydenyoung/Development/orbitdb/orbit-db/test/browser/webpack:/@orbitdb/core/node_modules/libp2p/dist/src/upgrader.js:312:1)
    at ConnectionImpl.newStream (/home/haydenyoung/Development/orbitdb/orbit-db/test/browser/webpack:/@orbitdb/core/node_modules/libp2p/dist/src/connection/index.js:85:1)
    at Libp2pNode.dialProtocol (/home/haydenyoung/Development/orbitdb/orbit-db/test/browser/webpack:/@orbitdb/core/node_modules/libp2p/dist/src/libp2p.js:230:1)
    at Network._writeMessage (/home/haydenyoung/Development/orbitdb/orbit-db/test/browser/webpack:/@orbitdb/core/node_modules/ipfs-bitswap/dist/src/network.js:190:1)
    at Network.sendMessage (/home/haydenyoung/Development/orbitdb/orbit-db/test/browser/webpack:/@orbitdb/core/node_modules/ipfs-bitswap/dist/src/network.js:167:1)

LibP2P is also throwing similar errors regarding mplex and gossipsub but I think the issue with gossipsub is simply a side effect of the underlying connection closing prematurely.

We’re not sure why the connection suddenly closes and we haven’t had much success debugging it. I’m tempted to assume that the underlying LibP2P is the culprit but am not sure why it would drop the connection with such consistency.

I haven’t opened a Github issue yet as I’m first ruling out a configuration issue with our setup.

2 Likes

I’ve replicated this locally, thanks so much for the detailed repro steps.

From what I can see, you’ve got browser nodes connected to each other through a relay to perform the sync.

As of libp2p@0.43.x the relay implementation is Circuit Relay v2 - this is a little different to v1 in that all connections are limited in time/data.

The idea here is that relays should be cheap to run and turned on by default - this makes every publicly diallable node on the network usable to coordinate NAT hole punching, to help browsers dial each other, etc, but you don’t want to allow peers to transfer unlimited amounts of data forever as that would be rife for abuse.

In the test the browser nodes are transferring all data via the relay which trips the data limit and the connection is closed.

You have a couple of options:

1. Turn off relay limits

This restores the behaviour of circuit relay v1 and is not advised for production deployment, but it’s ok for testing.

circuitRelayServer({
  reservations: {
    applyDefaultLimit: false
  }
})

Unfortunately there’s actually a bug around this config setting which will be fixed in the next release, so for now you should increase the data limit instead:

circuitRelayServer({
  reservations: {
    defaultDataLimit: BigInt(1024 * 1024 * 1024) // or just something big
  }
})

2. Use a direct connection between browsers

The reason it works in node.js and not browsers is node.js nodes can dial each other directly so there are no limits.

To accomplish something similar in browsers you should get them to connect via WebRTC - there’s an example of how to get this working in browsers here.

2 Likes

Thanks for the recommendations. I have implemented:

defaultDataLimit: BigInt(1024 * 1024 * 1024)

which has successfully resolved the issue.

In regards to 2. Use a direct connection between browsers, we run the following to a) discover browser addresses and b) connect the two browser peers:

    const relayId = '12D3KooWAJjbRkp8FPF5MKgMU53aUTxWkqvDrs4zc1VMbwRwfsbE'

    await ipfs1.libp2p.dial(multiaddr(`/ip4/127.0.0.1/tcp/12345/ws/p2p/${relayId}`))
    await ipfs2.libp2p.dial(multiaddr(`/ip4/127.0.0.1/tcp/12345/ws/p2p/${relayId}`))

    const a1 = multiaddr(`/ip4/127.0.0.1/tcp/12345/ws/p2p/${relayId}/p2p-circuit/p2p/${ipfs1.libp2p.peerId.toString()}`)
    const a2 = multiaddr(`/ip4/127.0.0.1/tcp/12345/ws/p2p/${relayId}/p2p-circuit/p2p/${ipfs2.libp2p.peerId.toString()}`)

^ The two peers have connected to the relay and have discovered their “public” addresses. They are now able to connect directly.

    await ipfs2.libp2p.dial(a1)
    await ipfs1.libp2p.dial(a2)

^ The two peers dial each other, creating a direct connection.

However, ipfs1 and ipfs2 are still using the relay for communicating (I’m assuming it is a simple case of use the first available connection so relay is used instead of ipfs1->ipfs2) and the relay limits kick in. Therefore, would I manually disconnect each peer from the relay in order for ipfs1 and ipfs2 to “communicate directly”? I’ve tried using libp2p’s hangUp function but without success.

1 Like

FYI libp2p@0.46.14 has the fix for applyDefaultLimit mentioned above so you can use that instead of defaultDataLimit now if you wish.


In the code snippet above a1 and a2 are still circuit relay addresses. You would have to do something else like:

import { WebRTC } from '@multiformats/multiaddr-matcher'

const relayId = '12D3KooWAJjbRkp8FPF5MKgMU53aUTxWkqvDrs4zc1VMbwRwfsbE'

// ipfs2 dials the relay
await ipfs2.libp2p.dial(multiaddr(`/ip4/127.0.0.1/tcp/12345/ws/p2p/${relayId}`))

// wait for ipfs2 to request a relay reservation and start listening on a WebRTC address
let webRTCAddress

while (true) {
   webRTCAddress = ipfs2.libp2p.getMultiaddrs()
    .filter(ma => WebRTC.exactMatch(ma))
    .pop()

  if (webRTCAddress != null) {
    break
  }

  // try again in a bit
  await delay(100)
}

// make direct connection - no need to dial in both directions
await ipfs1.libp2p.dial(webRTCAddress)
1 Like

Again, thanks for the help.

FYI libp2p@0.46.14 has the fix for applyDefaultLimit mentioned above so you can use that instead of defaultDataLimit now if you wish.

Okay, good news. Will evaluate.

In the code snippet above a1 and a2 are still circuit relay addresses. You would have to do something else like:

Thanks and makes sense. I have implemented the recommendation to filter out the webrtc address and use the filtered address for connecting peers directly. The peer connection is set up during each test suite (per “test.js” file, in the before() setup) and all test suites run up until the 3rd p2p connection (I.e. the 3rd time connectPeers() is called), whereby the two peers seem to connect but are unable to communicate, eventually resulting in a timeout.

If I skip the offending tests, the timeout occurs at the next connectPeers (I.e. again, the 3rd p2p connection).

If I send everything through the relay, it works (mostly, I sometimes get a timeout on the last test suite, but it is inconsistent) so I’m wondering if there is an issue with OrbitDB’s helia/libp2p config.

I’ve also used DEBUG='libp2p:*' to see if there is any difference between communicating directly and over the relay and the only difference appears to be right before the peers communicate with one another using the /orbitdb/heads protocol. For direct p2p, I see an mplex error and the connection hangs:

libp2p:mplex error in sink +1ms Error: The operation was aborted

I have used your recommendation to create a p2p connection and peer2 appears to dial peer1 successfully.

Further to my comment above, I have managed to test a direct browser-to-browser connection in isolation and am finding that libp2p.dial hangs when I attempt to dial ipfs1 from ipfs2.

Logging the relay server yields:

libp2p:circuit-relay:server:error hop connect denied for destination peer 12D3KooWL5vXWtJ1dCRGiJGpck4V7aReT4nZ2A9cyBpQ2bWCkFdZ not having a reservation for 12D3KooWCXBqMowVg6pyXNCi3Qt75XQgjxd9edy6gCrWScjLFhuf with status NO_RESERVATION +0ms

and the same mplex operation aborted error as above but I’m not sure if the relay connect denied error is related to libp2p.dial eventually timing out.

Imho, you should receive many WebRTC mulitaddrs for each peer which could be of interest!

E.g. 127.0.0.1, LAN addresses, public addresses and maybe even VPN addresses.

The code example from @achingbrain might not be 100% correct. (I didn’t try - I just guess)

while (true) {
webRTCAddress = ipfs2.libp2p.getMultiaddrs()
.filter(ma => WebRTC.exactMatch(ma))
.pop()
// DON’T USE THE FIRST BEST webRTC webRTCAddress collect them all and
if (webRTCAddress != null) {
break
}

// try again in a bit
await delay(100)
}

The first WebRTC address might not be the best! I’d collect and dial them all.

Thanks for the suggestion. Unfortunately, dialling each of the webrtc addresses results in the same issue; a timeout with a NO_RESERVATION error. I suspect that the issue occurs when peer2 attempts to initiate a connection with peer1 via the relay, whereby peer1 is not willing to facilitate the connection with peer2 because peer2 is not in peer1’s reservation list. However, I’m not sure why this is the case.

This problem does look very similar the the previous issue with the relay and it is resolved by increasing the number of reservations the relay is willing to handle. However, I’m not sure if the same configuration applies to a browser-based peer.

My limited understanding of webrtc negotiation assume:

  1. peerA requests a listening address from the relay,
  2. peerB connects to peerA using the listening address, but, because a browser node cannot listen for connections, relay handles this requirement, negotiating the connection between peerA and peerB,
  3. Once the relay has facilitated the connection of peerB to peerA, the two peers can now communicate directly without the need of relay.

The problem I’m seeing is at 2, which I believer to be caused by the relay refusing to negotiate the connection on peerA and peerB’s behalf because peerA is unaware of peerB.

Adding to this, I ran the browser-to-browser example from the libp2p examples and am seeing the same NO_RESERVATION issue (I have also applied the defaultDataLimit limits to the relay (my understanding is I shouldn’t have to because they don’t apply if I get the peers directly connected).

So, it seems there may be another issue here with direct webrtc connections between browser peers.

Is anyone able to provide some advice on how to resolve this issue? I have seen a similar issue raised with the rust implementation but I’m not sure how to implement in js-libp2p.

Adding to this, I ran the browser-to-browser example from the libp2p examples and am seeing the same NO_RESERVATION issue (I have also applied the defaultDataLimit limits to the relay (my understanding is I shouldn’t have to because they don’t apply if I get the peers directly connected).

Turned out the problem here was a configuration issue in the relay, in particular reservationClearInterval was clearing connections before peers completed connecting to one another.

1 Like