Been thinking about this for a bit and now might be as good a time as any to see if any of it is useful to anyone due to the increased need/activity for more efficient data transfers in bitswap I have been seeing around.
##History
A problem that IPFS had in the past was the extreme reliance of all content discovery being routed through DHT. Not only was DHT becoming slow (~1 sec response time/request) it was also becoming a burden on the entire cluster as each request for each block for each file needed to go through many machines and network hops to obtain answers. While DHT scales well the sheer number of requests being fed through the cluster was becoming problematic.
Enters bitswap sessions.
Bitswap sessions in it simplest form first asks all connected peers to send desired blocks if they have them and if after a short timeout no connected peers send the block then the system resorts to DHT to find a peer that can.
This first pass did successfully accomplish its main goal of having a faster content discovery method that no longer relies on DHT but has however come at a massive network efficiency cost through many duplicate blocks being received and high number additional packets being sent and received.
Inherently if you ask 900 peers to provide a block, you should not be surprised when more than one response is received, especially for popular content. A alternate method is required.
##Overview
At first thought my bet for a quick fix may be to simply add a round trip by spamming an availability check, then only send the block wantlist to only one of the respondents. Below I go into more detail about this strategy and a few others.
There is 2 main models for which to gather and use information for content discovery, Gossip and Prediction. (Active and Passive)
Prediction based content discovery
- Makes guesses based on passively gathered information and behaviors. It does not create any network traffic of its own and does not require remote peers to be updated/cooperate.
- Data can be collected from DHT, Bitswap, Bitswap sessions, and Gossip.
- It will generally be difficult to implement correctly and incorrect guesses only waste time and resources, however proper implementations have the potential for massive gains.
Gossip based content discovery
- Active in nature and uses additional requests to probe for where content is, requiring remote peer cooperation.
- The goal is to create as few lightweight requests as possible to have the ability to bypass expensive and/or time sensitive requests.
- Results are based on assertions and not behavior.
- Ability to make peers accountable for bad assertions through signed responses.
- Improper implementation will lead to more resources being used on background chatter than what is saved, but like prediction can have massive gains if a successful implementation can be created.
##High level overview of strategies
###âGreat peers think alikeâ long term prediction
- Primary goal of this module is to extrapolate usage patterns and behaviors.
- Peers that often respond with blocks we are requesting may have similar browsing habits and if so will have higher chance to have future blocks.
- A first run implementation of this strategy could be a simple hit/miss ratio sorted list.
- More extravagant methods can be bolted on to improve accuracy.
###âDAG Awarenessâ short term prediction
- Similar to GPTA however unlike the former that uses relationships between user behavior, this strategy directly relies on the relationship of connected data directly due to how the underlying storage system (mirkle trees/DAG) works.
- It appears that a first run implementation is already being worked on in PR #8.
- This strategy should excel at short term predictions and require little to no âwarm upâ in comparison to GPTA. If both predictions are to be used at the same time I would priorities this then check the longer term GPTA.
###âHintFloodâ gossip
- âCan you provideâ request blasted out to many peers at once.
- Extremely similar to existing bitswap session behavior, except instead of directly asking for blocks we query for availability instead.
- Still has the same underlying problem as current bitswap sessions but efficiency is greatly improved as duplicate responses are tiny in comparison (~100 byte packet vs 256kb blocks) at the expense of an extra round trip.
- Packets per second(PPS) can still be problematic as you are still blasting ~900 systems at a time per block.
- Can be combined with prediction to first only request from a subset of peers (ex. top 50 likely to respond).
- IPLD Selectors would greatly improve efficiency, but not strictly required.
###âHotsheetâ gossip
- This is a request that can ask nodes what content they have that they believe to be popular and likely requested in the future.
- This will allow for eliminating discovery requests if the block your are looking for have been previously listed and cached.
- Hotsheets could optionally use truncated/compressed ObjectIDs to reduce payload/memory requirements at the expense of lower accuracy.
- IPLD Selectors would greatly improve efficiency, but not strictly required.
###âHintlistâ gossip
- Peers that cannot provide, but know who can likely provide blocks respond with a hint object.
- You could make Hintlists interoperable with HintFlood and Hotsheets by having HintFlood respond with a full hint object (not just an affirmation) referencing its own PeerID and storing hotsheet responses as hints.
- This can be potentially abused so keeping track of peers that offer âbadâ hints and ignoring them may need to be in the first implementation.
####Strong positive hint example
{
PeerID: âQmPeerAâŚâ
Object: âQmObjectAâŚâ
properties: {âhasâ, âall children objectsâ, âRecursiveâ, âPinnedâ}
timestamp: â1538154322â
signature: ââŚâ
}
âI have this, all of its descendants, and eagerly want to distribute itâ
####Weak positive hint example
{
PeerID: âQmPeerAâŚâ
Object: âQmObjectAâŚâ
properties: {âhasâ, âpartal children objectsâ}
timestamp: â1538154322â
signature: ââŚâ
}
âI have only this block and some blocks related to thisâ
Full implementations for these strategies will likely involve having them utilize each other which has potential to have some neat systemic results like the previously mentioned HintFlood + Hotsheet + Hintlist combo along with the final module I would like to bring up.
###âHintBrokerâ module
- This one need to be prefaced with the knowledge that properly implementing this one is likely to be a pie in the sky dream. However with that said perhaps someone can use its ideas for something else that is more possible.
- HintBroker content discovery combines all other prediction and gossip system together to create a full content discovery module in scale comparable to DHT.
- Peers that maintain high up times and large hintlists can become high volume, high accuracy content discovery providers.
- These peers effectively become info brokers, treating valid hints as a valuable that can be exchanged with other HintBrokers.
- Usage results in a positive feedback loop where random peers utilize it for hints, which gives the hintbroker more info to provide more accurate hints, which brings around more peers asking for hintsâŚ
- Hintbrokers can optionally also monitor the availability of a blocks and test if assertions are correct and peers are connectable before offering them as responses.
###Side observation
Interestingly it seems she first implementation of bitswap sessions follows a lot of key ideas and issues from the first implementation of pubsub called floodsub with the second gen being a much more efficient variant called wispersub. Perhaps the solution for converting floodsub to wispersub might end up being similar to convert floodswap to wisperswap.
Some refs
a few others that I cannot find anymoreâŚ