Indirect content addressing for DHT privacy

Problem: Network surveillance nodes can observe DHT requests for content, and fetch the data themselves, thus mapping PeerID -> File.

Currently, the IPFS DHT stores HASH(content) -> File
This idea is to instead use HASH(HASH(content)) -> File, where nodes only send File iff they receive HASH(PeerID | HASH(content)) from the requesting party.

CIDs remain the same <…><HASH(content)>, but are flagged (e.g. custom multi-hash) for publishing only HASH(HASH(content)) to the DHT.

Edit: Or perhaps better as CIDv2 spec (as opposed to bitmasking multi-codec or multi-hash)

I’m not sure I follow everything, but I don’t sure it will solve the problem.

Let say peerA publish a file. It will annource HASH(HASH(content)) to the DHT.
Then peerB look for the file on the DHT. The DHT says that peerA has it. PeerA is deanonymized already.
Then peerB send HASH(PeerID_A | HASH(content)) or HASH(PeerID_B | HASH(content)) and get the file.

Am I misunderstanding something?

This change specifically protects against network observation of content. e.g. a nation state that monitors the network.

Scenario 1
Eve has a malicious IPFS node that always responds ‘yes’ to queries for content, but then itself just fetches the data in real-time, storing a copy along the way.

Scenario 2
Eve passively gather IPFS file hash requests from peers, and downloads a copy in parallel.

This solution works because both parties know HASH(content) (from the CID) but Eve does not.
Eve does not know HASH(content) and therefore can never generate HASH(PeerID | HASH(content)), which is required to download the file from a peer.

Edit: It is of course essential that IPFS not leak CIDs during requests for content.

I’m not sure how this is malicous.

This will break discoverability. How would an honest node learn about a new CID and where to fetch it? So in your model, an honest peer can only fetch a file if they know beforehand both the CID and the peer, right?
In that case, your solution is easy to implement: just make the “publishing” peer not announce its file on the DHT (or anywhere else). Peers who know the CID and the peerID of the “publisher” will contact it directly if they want the file.

Do you want your solution to run as a replacement of the current model, or in parallel?

I see. A sort of ‘CDN node’ which caches all content requests. That’s a fair point. Seems like a tradeoff then, between the value of network content privacy and efficient cache nodes.
I can see favoring caching in the average case, and the value from caching exceeding the value of privacy on average as well.
Perhaps this could be an opt-in feature instead.

An honest node, once they know the CID (from a side channel), can query the DHT with HASH(HASH(content)) for a peer.

Edit: hmm. I’m not sure I’m convinced that (non-explicit) caching nodes will provide significant network value.

By default, IPFS nodes cache what they fetch up to their storage limit. They don’t actively fetch content, though.

Actually, yes, they do. Protocol Labs is setting up “prefetch nodes” and “Hydra nodes” to accelerate the network. Even if the impact was minimal, it wouldn’t be “malicious” as they improve speed and replication.

Yes I think so.

So is the following workflow what you had in mind?
The publishing node (Alice), will publish HASH(HASH(content)) (aka “publishing key”) and tell the DHT: I will serve the corresponding file if someone gives me the CID ( aka Hash(content)).
Now Bob has a peerID close to the publishing key. He is responsible for that record.
Now Carol wants to fetch the file. The DHT gives her the address of Bob. I guess Bob remembered Alice’s peerID and gives it to Carol.
Carol gives Alice the CID (that she learned out of band). Alice sends the file to Carol. Carol advertises in the DHT that she serves “publishing key”. The DHT tells Bob that Carol is now another provider for the content.

Is that correct?

In that case, there is an attack for Eve.
Eve listens to the DHT traffic and sees Alice announcing being a provider if given the CID for “publish key”.
Eve spawns a new identity whose peerID is close to “publish key” and the DHT assigns her as the responsible for that record. Call it EveDHT. Eve also spawns another identity EveProvider.
Carol asks the DHT and find EveDHT. Carol ask for the peerId of a provider.
EveDHT tell Carol that EveProvider has it. Carol sends the CID to EveProvider hoping to have the content. EveProvider computes Hash(CID) and find “publish key”! Success!

Now Eve can track this content as any other.

Having the requesting node send HASH(PeerID | HASH(content)) and not only HASH(content) doesn’t change anything: Eve can ask Bob for Alice peerID beforehand.

Interesting (hydra-booster). Still, I wonder how much value is in prefetching vs. simply PeerId routing. ‘Malicious’ here has to do with intent (e.g. to collect information about individuals), obviously a caching node may not be malicious.

HASH(PeerID | HASH(content)) protects from the replay attack you described, because Eve will not be able to generate it without HASH(content). Eve needs HASH(EvePeerId | HASH(content)) or HASH(AlicePeerID | HASH(content)) to receive the file, but no peer will ever send that to Eve.

I think Prefetch nodes are for popular content so regular nodes can have 1 or 2 less hops, often, and connect to fast nodes rather than regular nodes.
Hydra boosters will be to short-circuit part of the DHT and find a profider in 2 hops (let’s say) rather than 5 for most of the queries.
But I don’t know about the real performance gain, though. I guess we will have to wait for a blog post once hydra nodes are fully operational. I know they test it with Testgroud, though.

You’re right, my bad.
And Carol can also advertise the content under the publish key HASH(HASH(content)) to replicate the content.
This seems to work. \o/

Related (I haven’t read them yet; it’s quite old for IPFS) : https://github.com/gpestana/notes/issues/8