Bitswap efficiency brainstorm

Been thinking about this for a bit and now might be as good a time as any to see if any of it is useful to anyone due to the increased need/activity for more efficient data transfers in bitswap I have been seeing around.

##History
A problem that IPFS had in the past was the extreme reliance of all content discovery being routed through DHT. Not only was DHT becoming slow (~1 sec response time/request) it was also becoming a burden on the entire cluster as each request for each block for each file needed to go through many machines and network hops to obtain answers. While DHT scales well the sheer number of requests being fed through the cluster was becoming problematic.

Enters bitswap sessions.

Bitswap sessions in it simplest form first asks all connected peers to send desired blocks if they have them and if after a short timeout no connected peers send the block then the system resorts to DHT to find a peer that can.

This first pass did successfully accomplish its main goal of having a faster content discovery method that no longer relies on DHT but has however come at a massive network efficiency cost through many duplicate blocks being received and high number additional packets being sent and received.

Inherently if you ask 900 peers to provide a block, you should not be surprised when more than one response is received, especially for popular content. A alternate method is required.


##Overview
At first thought my bet for a quick fix may be to simply add a round trip by spamming an availability check, then only send the block wantlist to only one of the respondents. Below I go into more detail about this strategy and a few others.

There is 2 main models for which to gather and use information for content discovery, Gossip and Prediction. (Active and Passive)

Prediction based content discovery

  • Makes guesses based on passively gathered information and behaviors. It does not create any network traffic of its own and does not require remote peers to be updated/cooperate.
  • Data can be collected from DHT, Bitswap, Bitswap sessions, and Gossip.
  • It will generally be difficult to implement correctly and incorrect guesses only waste time and resources, however proper implementations have the potential for massive gains.

Gossip based content discovery

  • Active in nature and uses additional requests to probe for where content is, requiring remote peer cooperation.
  • The goal is to create as few lightweight requests as possible to have the ability to bypass expensive and/or time sensitive requests.
  • Results are based on assertions and not behavior.
  • Ability to make peers accountable for bad assertions through signed responses.
  • Improper implementation will lead to more resources being used on background chatter than what is saved, but like prediction can have massive gains if a successful implementation can be created.

##High level overview of strategies

###“Great peers think alike” long term prediction

  • Primary goal of this module is to extrapolate usage patterns and behaviors.
  • Peers that often respond with blocks we are requesting may have similar browsing habits and if so will have higher chance to have future blocks.
  • A first run implementation of this strategy could be a simple hit/miss ratio sorted list.
  • More extravagant methods can be bolted on to improve accuracy.

###“DAG Awareness” short term prediction

  • Similar to GPTA however unlike the former that uses relationships between user behavior, this strategy directly relies on the relationship of connected data directly due to how the underlying storage system (mirkle trees/DAG) works.
  • It appears that a first run implementation is already being worked on in PR #8.
  • This strategy should excel at short term predictions and require little to no ‘warm up’ in comparison to GPTA. If both predictions are to be used at the same time I would priorities this then check the longer term GPTA.

###“HintFlood” gossip

  • ‘Can you provide’ request blasted out to many peers at once.
  • Extremely similar to existing bitswap session behavior, except instead of directly asking for blocks we query for availability instead.
  • Still has the same underlying problem as current bitswap sessions but efficiency is greatly improved as duplicate responses are tiny in comparison (~100 byte packet vs 256kb blocks) at the expense of an extra round trip.
  • Packets per second(PPS) can still be problematic as you are still blasting ~900 systems at a time per block.
  • Can be combined with prediction to first only request from a subset of peers (ex. top 50 likely to respond).
  • IPLD Selectors would greatly improve efficiency, but not strictly required.

###“Hotsheet” gossip

  • This is a request that can ask nodes what content they have that they believe to be popular and likely requested in the future.
  • This will allow for eliminating discovery requests if the block your are looking for have been previously listed and cached.
  • Hotsheets could optionally use truncated/compressed ObjectIDs to reduce payload/memory requirements at the expense of lower accuracy.
  • IPLD Selectors would greatly improve efficiency, but not strictly required.

###“Hintlist” gossip

  • Peers that cannot provide, but know who can likely provide blocks respond with a hint object.
  • You could make Hintlists interoperable with HintFlood and Hotsheets by having HintFlood respond with a full hint object (not just an affirmation) referencing its own PeerID and storing hotsheet responses as hints.
  • This can be potentially abused so keeping track of peers that offer ‘bad’ hints and ignoring them may need to be in the first implementation.

####Strong positive hint example

{
PeerID: “QmPeerA…”
Object: “QmObjectA…”
properties: {“has”, “all children objects”, “Recursive”, “Pinned”}
timestamp: “1538154322”
signature: “…”
}
“I have this, all of its descendants, and eagerly want to distribute it”

####Weak positive hint example

{
PeerID: “QmPeerA…”
Object: “QmObjectA…”
properties: {“has”, “partal children objects”}
timestamp: “1538154322”
signature: “…”
}
“I have only this block and some blocks related to this”

Full implementations for these strategies will likely involve having them utilize each other which has potential to have some neat systemic results like the previously mentioned HintFlood + Hotsheet + Hintlist combo along with the final module I would like to bring up.

###“HintBroker” module

  • This one need to be prefaced with the knowledge that properly implementing this one is likely to be a pie in the sky dream. However with that said perhaps someone can use its ideas for something else that is more possible.
  • HintBroker content discovery combines all other prediction and gossip system together to create a full content discovery module in scale comparable to DHT.
  • Peers that maintain high up times and large hintlists can become high volume, high accuracy content discovery providers.
  • These peers effectively become info brokers, treating valid hints as a valuable that can be exchanged with other HintBrokers.
  • Usage results in a positive feedback loop where random peers utilize it for hints, which gives the hintbroker more info to provide more accurate hints, which brings around more peers asking for hints…
  • Hintbrokers can optionally also monitor the availability of a blocks and test if assertions are correct and peers are connectable before offering them as responses.

###Side observation

Interestingly it seems she first implementation of bitswap sessions follows a lot of key ideas and issues from the first implementation of pubsub called floodsub with the second gen being a much more efficient variant called wispersub. Perhaps the solution for converting floodsub to wispersub might end up being similar to convert floodswap to wisperswap.


Some refs

a few others that I cannot find anymore…

3 Likes

First of all, thanks for asking for external input.

So, as a complete outsider, let me give me a summary and some thoughts.

There needs to be a mechanism to know where what is to avoid having to ask the DHT for every single block. This mechanism should not use excessive bandwidth. The mechanism also does not need to be perfect. Having to go to the DHT for 0.1% of all blocks would not be a big problem.

Hashes are pretty large and impossible to compress. So any mechanism that requires sending around full hashes is probably less than optimal . The mechanism described above provides good compression for some use cases by just sending around the root of a large Merkle DAG. But what if you just know that you want QmABCD, but don’t know that it is a child of merkle dag QmROOT? Then the above won’t help at all.

What does a node to do quickly figure out if it itself has a block to avoid going to disk? Probably some sort of bloom filter, since storing the entire set of hashes would be too expensive. So what if you could somehow gossip the bloom filters to neighbouring nodes?

If you have an incomplete bloom filter, you would get false negatives. Neighbour X has block Y, but the bloom filter says false. No big deal, you just get it from somebody else or ask the DHT.

There will also be false positives with bloom filters. In that case you ask for a block, don’t get it, and then have to go to the DHT. This is a slightly bigger deal since you only learn about the failure after a request times out. You might starting to ask the DHT or somebody else if you don’t get an answer from another candidate peer in a short time.

I don’t have the exact numbers in my head, but bloom filter bitmap of 10 bits per entry is supposedly sufficient for a >1% false positive rate. So for a pretty big node containing 1e9 blocks, the bloom filter bitmap is roughly 1GB. But it contains mostly zero bits, so it should be easy to compress when gossiping. And also, if you have an incomplete bitmap nothing is really broken, you just get a higher false negative rate.

So what about this:

  • A node stores bloom filter bitmap (or some other more efficient representation) for its current peers and keeps them up to date from some gossip protocol.
  • When a node wants to retrieve a block, if first checks the bloom filters for all its peers and gets a list of peers that have the block. It then requests the block from each peer that has it according to the bloom filter in a random, staggered fashion. In case nobody has it, it asks the DHT.
  • When it gains a new peer, it allocates a bloom filter bitmap for it and slowly fills it from gossip

There might be a clever way to encode the bloom filter deltas as hashed ipfs objects themselves. E.g. an update to a bloom filter bitmap could be encoded as a sorted set of offsets of the 1 bits, and then or-ed on the bitmap. But that needs some more thought.

By the way: can we get some stats on a typical large dataset? E.g. how many gigabytes and how many nodes is the wikipedia dataset?

1 Like

I have never heard of a bloom filter before… How have I never heard of bloom filters before??

These are amazing for just this use case. Being able to represent 100,000 blocks, at a 1% false positive rate requiring only 117 KB, yes please.

  • 0.1 MB filter = 83,464 hashes @ 99% accuracy
  • 1 MB filter = 834,633 hashes @ 99% accuracy
  • 10 MB filter = 8,346,324 hashes @ 99% accuracy

Tool to play around with filter sizes and probabilities
https://hur.st/bloomfilter/?n=&p=0.01&m=1MB&k=

Some constraints like not being able to remove keys from the filter without rebuilding it, but for the most part can be dealt with.

What hashes that go into the filters that get passed around can be varied to match a senarios. (Full maps vs only popular vs new hashes since last filter etc)

Might be time for me to learn some go.

1 Like

I think this can be generalised.

A node has a set of hashes. To have a cheap mechanism to transfer that knowledge to other nodes, we need a compact representation of a set of hashes that can be partially and incrementally transferred to other nodes. A bloom filter is such a representation, but there might be others.

It is worth noting that while the set of hashes that a node has might be large, there won’t be that much change. If the gossip protocol can take advantage of that, it should be efficient.

Of course you still need local storage for the information about which peer has what, but with bloom filters and other probabilistic data structures you can easily trade between space and accuracy. E.g. make the bitmap smaller, get a higher false positive rate. If you are really storage constrained you just completely ignore the bloom filter gossip.