How are peer IDs used for the Kademlia XOR metric?

Hi all!

I’m trying to trace down which peers are being contacted when routing over the DHT by analyzing the output of ipfs dht query -v Qm[...]. I’m converting multihashes into their hex representation (I’ve double checked that my script performs the exact multihash → hex transformation as https://cid.ipfs.io/). I expected to see IDs (in hex) converging towards the target ID. However the hex IDs of contacted peers are all over the ID space. Apparently these IDs are further hashed before being fed to the Kademlia XOR metric. The question is …how?! :slight_smile:

As an indication of the issue, I’ve run ipfs stats dht and I’m dumping my <Bucket 9> peer IDs, which should have a common prefix of 9 bits, however they are all over the space:

75b05d7d6944a7ecd646774643f7b7241597de2680d94a0dd21e80040f26df9b
60afcee016309ee3d23fac86b2fa70834dc5df6938352588e9eff0135b92b643
69ccc36d3424b42548501e49d3ae7f697176cea4a7f8f9a08aa6333be5b3fad5
95ed9920c9937f4ba39ded253606b3dd6c195ef832ebb2e21b182fb7333250e0
9f0799f0e23178a5f622a60b5ce9e4a78a7f776b9c540bead3845ba2b639f1fe
81c4dc5c02826f76fa3297d2613beffa77580294102574c34c5d33d189a9590b
a2efe4956aceca257bd9b7f4014d036bf7b984735e771658018106c2929cd341
4203ccdccb86a411ca8cf7b8c53dc23ba8ed6a8dbf00b498b95d99b812c05fc3
d58159830b3025cef09c8b811eee568eb2a7ae03cc47c1d519a20dd3ad0a311e
75adbf6eddd8b3b4f52b12991cb4fa38b2e2f1dafa970bf71afc6b20035e7cc1
2af30eeea6d32ce286408654877ff78e93ab8af843b5f64c6fd25cb8613575ea

For sanity check, the hex IDs listed above were derived from the following multihashes produced by ipfs stats dht for my Bucket 9:
QmWG4PUzaBF8cfJ4SYgfyDJyeF4pny6BHKAHbBx737aTKQ
QmUr5LLBJkSXXFGjT4YjWbGYvSKraoo4XJb63A6RmqmpbL
QmVTecRbxZhsYv8chZpPdBWdPDW6upZq1TH41hQYopEusz
QmYRub9Pktw3UNvsJ11tcZqM55YLq6CPLNPwzosTssWQUX
QmZ3SFouJMc5Rcxg9CgPZSDjNuHuuEDJD2GJXnAKytkU6h
QmX5DQa98umB9rdMnfaHKKycPhd8yxLgZRN8mhWjVA5m1k
QmZJgvG2UeAh6FSZJBYaG8ydXGGYF8BD1y7m1xF6CC6JnL
QmSnLyi5ZCtVvJrGRTSTv3jeDwB11onm9nX2crY2pZ9PKx
Qmci5wrnsWccGNLRx8bV67WvgBjkDGFvkH97zTqPViiBpR
QmWG25DqXUM6SgRWcjiR4BoF22vCgFvHiWHFtrFr5NHtXz
QmREJo4AnuodDeusPAgFvgzT6XMu4zKASiuyNRRbRdh8uf

Any insights would be appreciated!!

Thanks,
Spyros

You might want to take a look at IPFS 0.5 Content Routing Improvements: Deep Dive | IPFS Blog & News.

The short answer to your question is that the key space being used is SHA256(data) (e.g. SHA256(PeerID), SHA256(dataMultihash), etc.) so you’re not going to be seeing convergence towards the data itself, but rather towards SHA256(data)

1 Like

Thanks for your answer! However we have tried that as well… Namely, the SHA256 hashes of each of the above hex-form hashes are the following – still all over the place rather than sharing the common first 9 bits:

2051d904e5b78878ca7d156b8c14df24d31fd973b1ce922b6d190138ed0cd7b7  (this is for 75b05d7...)
def2b884bc614c248c9f9e027458ed7491749ee0dbf4899dc49ab24464b92636  (this is for 60afcee...)
4fb60259957eced8057aab910f99b3921f98205171c4bd87b541a4ac08fa0b27  (etc.)
30965340f6d68c905d628b687ce9cf45ecd9128a97156711f73b75f4dcb57ccf
63cbda6c2d5f2b7ca0edef34d9473cb476919ddeb58c058e1bc0d5f693179606
e094aff41165fcc2206e64fee4f61b78ff8c56ccf56753863d1c815672a46e37
aa1b6eb15f0a05300337d1c367c41eb20dd3b031420c84ab1e3c863eb11beb2d
1ad2000c2501e9cf03eeb5d254341d67c866a040368892591573d1bf5f74a866
d04b000c7794ed61d164018fe9f7f8a0088ad55d4b5952af75fac0409702b6a7
a08300c5e10ee07dc6774a9537663c3c92f398130084fa6f022a7d22950d7edd
0ab6be7cc68642c47591d218b872a442636baef8e491fc89aca11ffae8489de2

Please note we are hashing the binary version of the original hashes, i.e., the byte arrays (in Python), not the string hex representation (which we tried too, just in case, but thankfully it was not the case :slight_smile: )

Our Python script to get the direct hex representations of the multihashes is the following:

import multihash
import hashlib
str = 'QmWG4PUzaBF8cfJ4SYgfyDJyeF4pny6BHKAHbBx737aTKQ'  # first multihash in original post
bytes = str.encode('utf-8')
mh = multihash.decode(bytes, 'base58')
print(mh.digest.hex())
# --> This prints '75b05d7d6944a7ecd646774643f7b7241597de2680d94a0dd21e80040f26df9b'

Our script to get the SHA256 versions is the following:

import multihash
import hashlib
str = 'QmWG4PUzaBF8cfJ4SYgfyDJyeF4pny6BHKAHbBx737aTKQ'
bytes = str.encode('utf-8')
mh = multihash.decode(bytes, 'base58')
sh = hashlib.sha256(mh.digest)
print(sh.digest().hex())
# --> This prints '2051d904e5b78878ca7d156b8c14df24d31fd973b1ce922b6d190138ed0cd7b7'

Thanks in advance!
-Spyros

Try hashing the multihash instead of the digest itself. As you can see in GitHub - multiformats/multihash: Self describing hashes - for future proofing there are a few bytes before the hash digest to describe the hash function and the digest length.

So instead of doing hashlib.sha256(mh.digest) hash the full bytes of the multihash (not sure how py-multihash deals with that).