I was exploring IPFS-DHT implementation using the go-libp2p and go-libp2p-kad-dht libraries. As per the documentation (dht - GoDoc), GetClosestPeers is a kademlia Node lookup operation which gives K(=20) closest peers to the key (https://docs.ipfs.io/concepts/dht/#lookup-algorithm), and to measure the closeness, it takes XOR of hash(peerID) and hash(key).
I fed keys from “0” to “1000” to the GetClosestPeers API in a network of 50 peers, and observed that the K(=20) closest peers were not coming out to be uniformly distributed from the network, but is skewed [A]. I was expecting it to be nearly uniformly distributed on average. To verify my understanding, I tried manually taking XOR of hash of some dummy peers’ IDs and hash of keys ranging from “0” to “1000” to take out K(=20) closest peers for each key, and observed a similar skewness [B].

Below is the result for my experiments: [A]: Distribution of peers after running GetClosestPeers() on kad-dht for 1000 keys in a network of 50 peers.
[How to infer] The result is pair of {NodeID, percentage occurrences of this node out of 1000 KClosestPeer calls}
Total peers involved in this experiment 49
(Mentioned top 2 and bottom 2)
QmWJKJMGaw2474QQizyERjspkTjd8amtUMgpjBdx6bbtkc , 47.4 %
QmXcpUgJ1mNfQ9SJkiLNGMsBHzXdwFAYbPaAxnEPYz2WeA , 47.1 %
…
…
…
QmWATi2voJbFW2QhJ1cjVgBCezEif5sTJVuk9oLJi8m2sU , 26.499998 %
QmTALJQqCS6nLMDyWREdBzUYMnKSHnS38imG9dAfeeLJfo , 25.9 %

Mean : 40.81632618
Median: 42.899998
SD : 6.502569226
Min : 25.9 %
Max : 47.4 %
Ideally I expect each node to appear in ~40.8% of the GetClosestPeers() calls considering the network size of 50 Peers

[B]: Distribution of peer IDs after manually taking XOR of hash of 1000 keys with hash of 50 peer IDs.
[How to infer] Result is pair {NodeID, percentage occurrences of this node among K closest peers}
Mean : 40.0%
Median: 43.4%
SD : 10.412012293500234%
Min : 14.2 %
Max : 50.2 %
I wanted to know if this behaviour is expected, or is there something wrong in my understanding of how GetClosestPeers node lookup works

Your assumption on the distribution of randomly chosen numbers is incorrect. The numbers you are seeing appear to be as expected.

You never define what you mean by “nearly uniformly distributed on average”.

I think the following experiment will help you understand why your expectation is wrong:

Take 1000 balls and 50 bins. Throw each ball into a random bin. Then observe the resulting distribution. You will see that it is not even. This is called a Poisson process and contrary to intuition, it does not produce an even distribution.

By nearly uniform, I meant I expect each node to appear in [mean±error] of the GetClosestPeers() calls.

As a comparison with your thought experiment, 1000 balls correspond to 1000 keys, and 50 bins correspond to 50 peers(or the peerIDs). But we are not randomly throwing each key into some peer. We are choosing peers as per a distance function defined using XOR of hashed keys and peerIDs, as per the lookup-algorithm of Kademlia DHT. This is true for both [B], and [A] if my understanding of the algorithm is correct.

By nearly uniform, I meant I expect each node to appear in [mean±error] of the GetClosestPeers() calls.

FYI, this means you allow no samples outside of a standard deviation. This is not what we expect, because
this requirement is unreasonably strong. In fact, there are theorems that show that no random sampling algorithms with
independent choices (keys and peer ids are independently random) can attain this property.

As a comparison with your thought experiment, 1000 balls correspond to 1000 keys, and 50 bins correspond to 50 peers(or the peerIDs). But we are not randomly throwing each key into some peer. We are choosing peers as per a distance function defined using XOR of hashed keys and peerIDs, as per the lookup-algorithm of Kademlia DHT. This is true for both [B], and [A] if my understanding of the algorithm is correct.

There is an equivalence between the simple experiment I proposed and the more-complex looking algorithm of assigning keys to closest peers.
However, the equivalence is not obvious and does not fit in an email (as it requires quite some mathematical machinery to explain).

I simply wanted to show you that what you expect is not true for basic random number distribution (and it is even less true for the XOR-proximity bucketing).
But you have another way of convincing yourself that your expectation is unreasonable: Why don’t you simulate the actual algorithm itself in a short script.

Pick 50 random numbers (ids).

Pick 1000 random numbers (keys) and assign each to the XOR-closest id.

Check if the resulting distribution meets the condition (above) you expect.

This should convince you that your expectation is unreasonable, without having to understand the complex mathematics of why.