I think you touched and descriped many important aspects. I am convinced, that as the general model, yours is quite sufficient.
There are Crawlers, that Index IPFS or other crawlers how they see fit and publish their results to whoever wants to listen.
Then there are Searchers, that choose the kinds of Crawlers they want to trust and fetch their Index files
to perform searches.
But I think, that there should also be a low cost way of establishing pools of crawlers,
that work together on an index with either a certain guideline, rules, etc.
Like for example a crawler pool for The General News crawlers.
In a pool, there are common guidelines and maybe even programs for the crawlers to adhere to.
A pool has a single index, that I will call INDEX for now.
And if a crawler of the pool wants to make a proposal to add or remove something from the INDEX, these changes will be verified and voted upon by the members of the pool by their own rules. After that they will update the INDEX.
Through this, broker crawlers won’t have to keep a connection to every “General News Crawler”, but instead just fetch the current version of their INDEX.
Through this, small crawlers can still be utilized and new crawlers have an entry point into the system of trust, that the brokers or searchers establish.
Otherwise I think, that instead of proposing an actual protocol between crawlers and searchers, we should use IPNS and define what kinds of file should be put into IPNS, in order to be read by other searchers and crawlers.
For example: We could say that every crawler should have the current version of their index on their IPNS in a file /index or something like this.
But if there is such a need as you are saying that it is and as I can imagine there to be, then I am asking:
What kind of INDEX File Format should be used?
Should there even be a defined standard?
What kinds of search index should be supported?
(https://en.wikipedia.org/wiki/Search_engine_indexing)
There is clear benefit to providing such a standard:
The interaction between different crawler/searcher communities is not
complicated.
But the problem is, that such a standard is really hard to achieve.
Looks like a good model. I would organize the pool as following (mostly summing up your design).
Running crawlers:
Run a few first crawlers for bootstrapping purpose.
They connect to each other and subscribe to a common topic: generalNewsCrawlers/v1.
They run a common OrbitDB in CRDT mode
They add “records” (let’s call it that way) to the DB (aka pool INDEX) along with their signature and metadata (maybe date, a provider (to shorter lookup for searchers), possibly the IPNS record or DNSLink, the size, a guess on the type of file, etc.)
To verify others work, they could take the records of others, reindex it and compare the result. If it is the same, they add their signature to the record in the index and increase the confidence they have in that fellow crawler. If not, they broadcast a message on PubSub: "Crawler X made a mistake on CID Y. It did Z when we expected W. See their signature of the faulty work: S. Here is my own signature for this alert broadcast. "
Upon receiving the alert message, the other crawlers check who is faulty (the bad indexer, or the whistle-blower?) and either relay the message or drop it and bring shame on the faulty whistle-blower instead. The faulty node is penalized, and a honest one is rewarded (increased confidence).
Below a certain confidence threshold, honest nodes drop the bad crawler from their local routing tables.
Searchers:
Searchers connect to one or several crawlers. They do not subscribe to generalNewsCrawlers/v1 which is only for crawlers (if they do, they won’t be penalized but will quickly be dropped as crawlers prefer keeping connections with honest crawlers, not cute but non-indexing searchers).
They query one or several crawlers. They got result back. They should have almost the same result from all crawlers (not exactly, as they are always indexing). These results have different signatures form different crawlers. Searchers compute the union of the results.
The searcher then filters and order as it sees fit. Possible criteria to combine: the order proposed by the crawler, number of signature (more trusted result), date it was seen first, different weights for the signature of different crawlers, etc.
Crawler joining the pool:
Jonny (joining crawler) connects to some Crawlers of the pool. He will enter a trial period and will seek the approval of a “senior” crawler who has successfully passed this trial and is in the pool.
Jonny asks a senior crawler for CID to index.
Jonny indexes them but doesn’t publish them to the Index. He sends them to the senior.
Senior crawler checks (some of) the results. If they are good, it sign them, it puts the record with both signature on the INDEX, and finally gossip about Good Jonny wanting to join in the PubSub channel. If Jonny tried to publish himself to the INDEX, or if results are not following the pool standard, they are rejected and Jonny is badly gossiped about and he has to start from scratch again after a backoff period. This backoff global as all seniors know when Jonny was bitten by his senior. If Jonny tries to make his senior check a too big file, he is punished too as he tried to DoS a senior. The Senior informed him of the maximum size, or it’s written in the pool’s rules.
Senior challenges Jonny with bigger and bigger files.
The more Jonny deliver, the more he is trusted by his peers. If he fails once, he starts from scratch.
After a lot of successful runs, he becomes a regular crawler and publishes himself to the INDEX. The senior send Jonny his diploma: a message saying that Jonny has now graduated, thanks to the Senior, + the diploma of the senior. This a chain of trust that can be traced to the original bootstrapers.
-Jonny can become a senior for a new Jonny, too.
Bootstrapping the pool:
Each peer wants to be a Jonny.
After some time, they see they didn’t find any senior crawler.
They say to N other Jonnys: "I didn’t find a senior crawler to send me tasks. Do you want to be my Senior, and I’ll be yours? "
The other Jonny either say “yes”, or “I found a Senior at this address. You should contact them, or try again with me in X time (I hope to have graduated by then).”
There is a possibility that several groups consolidate on their own with a risk of netsplit. To avoid that, seniors in each group can vet each other following the same process. They are Jonny’s in each other’s network. After successful vetting on both sides, their two INDEX and their two networks should eventually merge.
Open problems:
Depending on available resources, trust level among the pool, and pool size, crawlers may want to only check a fraction of the record.
We may want to check more for newer crawlers, with a low score, to evict bad ones faster, and not spend resources to check old, honest reliable nodes. However, this is a DoS vector (Sybil nodes spawning rapidly, joining and making honest node check their crap. Be evicted, rinse, repeat).
Bad crawlers have signed some records on the INDEX. How do we make the Searchers not trust these results?
– Revoke access to the INDEX and delete their record. How? And we lose information about bad peers.
– Make crawlers not send the records that were rejected.
– Make crawlers remember bad crawlers and not send the records that were sent by rejected crawlers (unless verified by someone else)
__ They can have a parallel OrbitDB which is a list of bad peers, along with the proof(s) of their faultiness.
How to prevent Jonny the Joiner to index useless data he generated and present that as results to the Crawlers? He will quickly earn the trust of the pool and be able to DoS it.
– Maybe Make the Senior send him some work to do first.
------ How do Crawlers know the message they just received saying a distant honest senior node trusted a distant Jonny with a result is legit? This unknown distant senior may be a Sybil.
---------- Should all Crawlers check Jonny’s results before trusting him? That increases DoS vector and doesn’t scale well if high churn.
---------- Alternatively should they check that Jonny’s Senior(s) was vetted by another crawler that was vetted by themselves (find a web of trust path going from Jonny to the skeptical crawler)? Then the web of trust should be stored on yet another OrbitDB, or be redundant enough to compute it on the fly by jumping from node to node. But long-range attacks could infiltrate good nodes that then introduces bad peers and say they trust them.
---------- Alternatively, we trust Jonny by default. BUT, we test our fellow crawlers regularly. If Jonny fails, we decrease our confidence in Jonny by 1, and his senior by 0.5 (to be tuned)
The rules of the pool will determine what is a good contribution and what is not. I guess the pool will provide an implementation to run.
Organizing a vote is tricky because of Sybil nodes voting power. Weighting by node “reputation” is tricky because the reputation is local.
Okay, I will stop now, this is getting out of hand.
IPFS based Search Engine already works and everyone can take part in creating a knowledge graph. I would be very happy if IPFS community takes an active part in this.
I’m greeted with total darkness. And can it be accessed from IPFS as well?
If it can’t, then it is no different then using DuckDuckGo with the site:ipfs.io in the search line.
This is more like a cyberlink knowledge graph. You must upload your files or IPFS hashes with keywords to the knowledge graph. After that you will be able to find it in search engine. The knowledge graph is now in the process of filling. It already has over 100k cyber links.
My Problem is:
Is it really more than just a search query on DuckDuckGo with the site: attribute set to an IPFS Gateway?
Is it decentralized, interplanetary, independent, censorship resistant or modular?
Can you download the knowledge graph yourself and work on it independently? Can you create your own search algorithms?
Yes, yes, yes and yes. I sent you links to some docs above. Check it out =)
It is decentralized. It is interplanetary. It is independent. There is no censorship. It is modular. You can fork the client or the chain. You can build your own graph, etc
You can create your own search algorithms. You can govern the whole system, via onchain governance, etc
What do you mean by that? Cyber uses IPFS as DB for storing content and it uses IPFS CIDs to create cyberlinks. However cyber can work with any stateful or statelss protocol as long as you can have a pair of CIDs and prove their source
I mean, suppose that cyber.page was blocked in my country. Could I still access it, for example through IPFS,
by downloading some software or accessing some File on IPFS?
cyber.page is just a POC reference gateway. No more. You can access the protocol via nay possible client. For example there is a TG bot @cyberdBot or there are firefox and chrome alpha extensions. We have started to work on a browser, called cyb, which is actually a personal blockchain application on top of a protocol. So no way to block it, unless you shut down the network.
Anyone is free to fork the client and to build whatever gateway they want to it, they could even make it private or semi private by filtering the front end.
Its still early days and and not in the mainnet. If you fancy, i’d be happy to chat and tell you how it works more in detail. Or you can check out the code on GH: https://github.com/cybercongress/go-cyber (thats the protocol repo)
I wonder if any blockchain-based system can be “interplanetary”. Up to 24 minutes of the latency between Earth and Mars easily kills it (miners/validators cannot be distributed at least). Also, one of the super fancy goals of IPFS is to be partitionable, that is:
Info and apps function equally well in local area networks and offline. The Web is a partitionable fabric, like the internet.
This is also a reasonable requirement for interplanetary system, and seems to exclude the current blockchain technology which assumes the existence of the global, unpartitioned internet.
I have to agree with you, @sinkuu That’s why it is not such a bad idea to have no global index, but rather small crawler communities,
that eventually exchange their findings, but don’t stay in constant contact with everything else.
I think that my MVP Crawler kind of works.
It creates a reverse index, mapping keywords to CIDs in the JSON Format.
I didn’t yet get it to automatically publish to IPNS, because that seems to take very long.
Hi! I created Cyber and would love to address your concerns.
I wonder if any blockchain-based system can be “interplanetary”. Up to 24 minutes of the latency between Earth and Mars easily kills it (miners/validators cannot be distributed at least)
You are correct that It’s likely we will not be able to sync one chain between Mars and Earth. But we should not. I am pretty sure semantics on Mars will be very different from the Earth semantics.
That is why we defined the mechanism which allow to prove the rank for any given CID from Earth chain to any knowledge graph which will run by Martians. So you will need to sync only ranks of anchor CIDs back and forth using some relay.
I am pretty sure that solution 5 will not be able to work without solution 4.
You can learn from Yaca, that you cant build the search engine which will be useful following complete bottom-up utopia. The reason for this is quite straightforward: relevance have to be somehow protected from trivial sybil attacks. You can not achieve this without some economics. And yep, you cant add economic layer without dlt due to double spends.
Another problem with bottom-up utopia is that due to inability to have the full index such search will never be able to answer questions better than top-down solution.
Top-down approach must not be complex and centralized around one blockchain to rule them all. Сheck wp
I am pretty sure that it is a good idea to develop bottom-up utopia on top of top-down so you can get the betterness from two worlds.