Would there be an interest, in an IPFS Search Engine?

Will they publish their verifications to the public? Send them to the peers they are connected too? Give them on a request basis? Or keep them locally?

I think they should send them on a request basis.

I think there won’t be a final INDEX File. I doubt we can have an eventually consistent INDEX file. One of the reasons being different crawlers will disagree on what is a valid update, and so will different searchers.
Every searcher will do its best to know a set of trusted crawlers, and the searcher’s INDEX file will basically be the union of the crawlers INDEX File they know and trust.
Depending on implementations, searchers could either query the crawlers for results, or ask them their Index files to build their own Index locally (faster queries, but more maintenance costs).
(I can also imagine a third type of nodes in the models: Brokers/Coordinators. They will fetch Indexes from crawlers to build a big Index so that Searchers can query them. To Crowlers, they look like Searchers building their Index. To Searchers not building local Index, they look like Crawlers with big search capacity. )

Do we? I think the model is BitTorrent here: find nodes to cooperate with, define “cooperative” for your own use-case/implementation (any node, a node seeding enough, a fast node, a node without data but useful on DHT lookups,etc.) and work with them.

I can see different implementations for Crawlers and Searchers, with different criteria, and it’s fine.

Plenty.
Let me introduce you several implementations serving several use-cases:

  1. Crawlers
  • The general purpose Crawlers. It listens to DHT traffic for new files and tries to increase its index every day. Most searchers use it.
  • The project indexer. It participates in a Collaborative cluster and index that.
  • The skeptical lazy verificator. It listens to the queries it receives and double-checks the results of other crawlers. Searchers like to query it because this fact-checker is always up to date on hot topics and reliable (unless you are the first to query).
  • The Grammar Nazi Crawler. This guy crawls the same source as everyone else, but it is looking for something different. For example, it indexes well-written texts, and check for syntax errors so you can be sure to read quality post.
  • The video Indexer Crawler. It fetches some frames and looks for cat videos. Most Searchers don’t bother contacting it, but cat lovers do.
  • The Greedy Bastard. It built an annotated pattent database and found similarities between some. Pretty neat, huh? But you’ll need to pay a few coins to make a query :/.
  • The General News crawlers. It crawls hot topics and makes a selection of articles. They say it works closely with the Grammar Nazi Crawler and the Is This Well-Written Crawler…
  • The Times crawler. It doesn’t crawl much. But if you want a Time Magazine article, ask for it here.
  • The personal social media Crawler. It subscribed to your friends’ nodes and indexes their posts. You can catch up with your friend David’s latest avocado toast pictures in one query.
  • The Hosted-With-foo crawler. Foo.com lets you host your website on IPFS easily. They have a crawler only for the websites they help setting up, so that on their front page, they can say: “A guy did this incredible website with our tech!”.
  • The Crawlers Crawler. Aka the Broker (see above). It sucks in all other Indexes and you can query only it to go super fast if you want. A bit centralized, but hey, you decide.
  1. Searchers
  • General-purpose. It can query all the general-purpose indexers and most of the others.
  • Specialized. Searching scientific papers. It gives its queries a super complicated and confusing format. Luckily, it is compatible with the Scientific Paper indexer.
  • The Cat video searcher looks for Cat videos. It leaves in a dedicated Desktop App and queries only the Cat Video Indexer.
  • More generally, there is a specialized searcher for every specialized crawler.
  • The Overly Patriotic Searcher. Queries general-purpose indexers, but filters results from IP too far from itself. Its nemesis, the Globe-Trotter Searcher, does the opposite.
  • The Fake News filter. It lets you search for any hot topic and serves you a random cat video instead. It’s for your own sake.

And there’re many more…

Anyway.

Where were we?

If there is no central authority AND no consensus (blockchain enforced or protocol enforced), you won’t have it, I think. But it’s fine. We’re going down the Solution 5 road. There should be a standard way to communicate what type of crawler you are and what type of cooperation you are looking for, if any, but there won’t be a standard way to build trust. Effective ways of communicating what you do should help the network not contacting you if they don’t like how you work, though.

I suspect most searchers won’t want to have a huge Index locally. They will keep no Index locally except for cache, and verify the results, not the whole merged Index. Some will want to build a local Index without also being full-on crawlers, but they will either ask for some small specialized Indexes, or check the provided Indexes probabilistically, or just check the trusted crawlers’ signatures.

Just the way I see it.

1 Like

I think you touched and descriped many important aspects. I am convinced, that as the general model, yours is quite sufficient.
There are Crawlers, that Index IPFS or other crawlers how they see fit and publish their results to whoever wants to listen.
Then there are Searchers, that choose the kinds of Crawlers they want to trust and fetch their Index files
to perform searches.

But I think, that there should also be a low cost way of establishing pools of crawlers,
that work together on an index with either a certain guideline, rules, etc.
Like for example a crawler pool for The General News crawlers.

In a pool, there are common guidelines and maybe even programs for the crawlers to adhere to.
A pool has a single index, that I will call INDEX for now.
And if a crawler of the pool wants to make a proposal to add or remove something from the INDEX, these changes will be verified and voted upon by the members of the pool by their own rules. After that they will update the INDEX.

Through this, broker crawlers won’t have to keep a connection to every “General News Crawler”, but instead just fetch the current version of their INDEX.
Through this, small crawlers can still be utilized and new crawlers have an entry point into the system of trust, that the brokers or searchers establish.

Otherwise I think, that instead of proposing an actual protocol between crawlers and searchers, we should use IPNS and define what kinds of file should be put into IPNS, in order to be read by other searchers and crawlers.
For example: We could say that every crawler should have the current version of their index on their IPNS in a file /index or something like this.

But if there is such a need as you are saying that it is and as I can imagine there to be, then I am asking:
What kind of INDEX File Format should be used?
Should there even be a defined standard?
What kinds of search index should be supported?
(https://en.wikipedia.org/wiki/Search_engine_indexing)
There is clear benefit to providing such a standard:
The interaction between different crawler/searcher communities is not
complicated.
But the problem is, that such a standard is really hard to achieve.

Looks like a good model. I would organize the pool as following (mostly summing up your design).

Running crawlers:

  • Run a few first crawlers for bootstrapping purpose.
  • They connect to each other and subscribe to a common topic: generalNewsCrawlers/v1.
  • They run a common OrbitDB in CRDT mode
  • They add “records” (let’s call it that way) to the DB (aka pool INDEX) along with their signature and metadata (maybe date, a provider (to shorter lookup for searchers), possibly the IPNS record or DNSLink, the size, a guess on the type of file, etc.)
  • To verify others work, they could take the records of others, reindex it and compare the result. If it is the same, they add their signature to the record in the index and increase the confidence they have in that fellow crawler. If not, they broadcast a message on PubSub: "Crawler X made a mistake on CID Y. It did Z when we expected W. See their signature of the faulty work: S. Here is my own signature for this alert broadcast. "
  • Upon receiving the alert message, the other crawlers check who is faulty (the bad indexer, or the whistle-blower?) and either relay the message or drop it and bring shame on the faulty whistle-blower instead. The faulty node is penalized, and a honest one is rewarded (increased confidence).
  • Below a certain confidence threshold, honest nodes drop the bad crawler from their local routing tables.

Searchers:

  • Searchers connect to one or several crawlers. They do not subscribe to generalNewsCrawlers/v1 which is only for crawlers (if they do, they won’t be penalized but will quickly be dropped as crawlers prefer keeping connections with honest crawlers, not cute but non-indexing searchers).
  • They query one or several crawlers. They got result back. They should have almost the same result from all crawlers (not exactly, as they are always indexing). These results have different signatures form different crawlers. Searchers compute the union of the results.
  • The searcher then filters and order as it sees fit. Possible criteria to combine: the order proposed by the crawler, number of signature (more trusted result), date it was seen first, different weights for the signature of different crawlers, etc.

Crawler joining the pool:

  • Jonny (joining crawler) connects to some Crawlers of the pool. He will enter a trial period and will seek the approval of a “senior” crawler who has successfully passed this trial and is in the pool.
  • Jonny asks a senior crawler for CID to index.
  • Jonny indexes them but doesn’t publish them to the Index. He sends them to the senior.
  • Senior crawler checks (some of) the results. If they are good, it sign them, it puts the record with both signature on the INDEX, and finally gossip about Good Jonny wanting to join in the PubSub channel. If Jonny tried to publish himself to the INDEX, or if results are not following the pool standard, they are rejected and Jonny is badly gossiped about and he has to start from scratch again after a backoff period. This backoff global as all seniors know when Jonny was bitten by his senior. If Jonny tries to make his senior check a too big file, he is punished too as he tried to DoS a senior. The Senior informed him of the maximum size, or it’s written in the pool’s rules.
  • Senior challenges Jonny with bigger and bigger files.
  • The more Jonny deliver, the more he is trusted by his peers. If he fails once, he starts from scratch.
  • After a lot of successful runs, he becomes a regular crawler and publishes himself to the INDEX. The senior send Jonny his diploma: a message saying that Jonny has now graduated, thanks to the Senior, + the diploma of the senior. This a chain of trust that can be traced to the original bootstrapers.
    -Jonny can become a senior for a new Jonny, too.

Bootstrapping the pool:

  • Each peer wants to be a Jonny.
  • After some time, they see they didn’t find any senior crawler.
  • They say to N other Jonnys: "I didn’t find a senior crawler to send me tasks. Do you want to be my Senior, and I’ll be yours? "
  • The other Jonny either say “yes”, or “I found a Senior at this address. You should contact them, or try again with me in X time (I hope to have graduated by then).”
  • There is a possibility that several groups consolidate on their own with a risk of netsplit. To avoid that, seniors in each group can vet each other following the same process. They are Jonny’s in each other’s network. After successful vetting on both sides, their two INDEX and their two networks should eventually merge.

Open problems:

  • Depending on available resources, trust level among the pool, and pool size, crawlers may want to only check a fraction of the record.
  • We may want to check more for newer crawlers, with a low score, to evict bad ones faster, and not spend resources to check old, honest reliable nodes. However, this is a DoS vector (Sybil nodes spawning rapidly, joining and making honest node check their crap. Be evicted, rinse, repeat).
  • Bad crawlers have signed some records on the INDEX. How do we make the Searchers not trust these results?
    – Revoke access to the INDEX and delete their record. How? And we lose information about bad peers.
    – Make crawlers not send the records that were rejected.
    – Make crawlers remember bad crawlers and not send the records that were sent by rejected crawlers (unless verified by someone else)
    __ They can have a parallel OrbitDB which is a list of bad peers, along with the proof(s) of their faultiness.
  • How to prevent Jonny the Joiner to index useless data he generated and present that as results to the Crawlers? He will quickly earn the trust of the pool and be able to DoS it.
    – Maybe Make the Senior send him some work to do first.
    ------ How do Crawlers know the message they just received saying a distant honest senior node trusted a distant Jonny with a result is legit? This unknown distant senior may be a Sybil.
    ---------- Should all Crawlers check Jonny’s results before trusting him? That increases DoS vector and doesn’t scale well if high churn.
    ---------- Alternatively should they check that Jonny’s Senior(s) was vetted by another crawler that was vetted by themselves (find a web of trust path going from Jonny to the skeptical crawler)? Then the web of trust should be stored on yet another OrbitDB, or be redundant enough to compute it on the fly by jumping from node to node. But long-range attacks could infiltrate good nodes that then introduces bad peers and say they trust them.
    ---------- Alternatively, we trust Jonny by default. BUT, we test our fellow crawlers regularly. If Jonny fails, we decrease our confidence in Jonny by 1, and his senior by 0.5 (to be tuned)

The rules of the pool will determine what is a good contribution and what is not. I guess the pool will provide an implementation to run.

Organizing a vote is tricky because of Sybil nodes voting power. Weighting by node “reputation” is tricky because the reputation is local.


Okay, I will stop now, this is getting out of hand.

I have started to try to develop a crawler MVP.
I am not yet going to implement pools, because they increase the complexity.

But for this, I still need a way to observe the network for files being transferred.

The conditions:

  1. The crawler has to be given the CID before adding them to the index. It doesn’t observe the network.
  2. It only adds two file types to the index: MIME Types text/plain and text/html.
  3. It publishes the results onto IPNS. The result is a JSON Dict, mapping from keyword to CIDs.
2 Likes

Great! Best of luck :muscle:!

Also: this:kissing_smiling_eyes: :notes:

There is a few already. There is cyber.page, which is a search engine \ protocol which works with IPFS. And there are a few crawlers

Here are the docs for cyber.

PS. I am affiliated

In short the way cyber works is something like this or here. There is of course the WP

@CSDUMMI would love to chat more =)

2 Likes

This is true.

IPFS based Search Engine already works and everyone can take part in creating a knowledge graph. I would be very happy if IPFS community takes an active part in this.

2 Likes

I’m greeted with total darkness. And can it be accessed from IPFS as well?
If it can’t, then it is no different then using DuckDuckGo with the site:ipfs.io in the search line.

This is more like a cyberlink knowledge graph. You must upload your files or IPFS hashes with keywords to the knowledge graph. After that you will be able to find it in search engine. The knowledge graph is now in the process of filling. It already has over 100k cyber links.

https://cyber.page/

1 Like

My Problem is:
Is it really more than just a search query on DuckDuckGo with the site: attribute set to an IPFS Gateway?
Is it decentralized, interplanetary, independent, censorship resistant or modular?
Can you download the knowledge graph yourself and work on it independently? Can you create your own search algorithms?

Yes, yes, yes and yes. I sent you links to some docs above. Check it out =)

It is decentralized. It is interplanetary. It is independent. There is no censorship. It is modular. You can fork the client or the chain. You can build your own graph, etc

You can create your own search algorithms. You can govern the whole system, via onchain governance, etc

What do you mean by that? Cyber uses IPFS as DB for storing content and it uses IPFS CIDs to create cyberlinks. However cyber can work with any stateful or statelss protocol as long as you can have a pair of CIDs and prove their source

1 Like

I mean, suppose that cyber.page was blocked in my country. Could I still access it, for example through IPFS,
by downloading some software or accessing some File on IPFS?

cyber.page is just a POC reference gateway. No more. You can access the protocol via nay possible client. For example there is a TG bot @cyberdBot or there are firefox and chrome alpha extensions. We have started to work on a browser, called cyb, which is actually a personal blockchain application on top of a protocol. So no way to block it, unless you shut down the network.

Anyone is free to fork the client and to build whatever gateway they want to it, they could even make it private or semi private by filtering the front end.

Its still early days and and not in the mainnet. If you fancy, i’d be happy to chat and tell you how it works more in detail. Or you can check out the code on GH: https://github.com/cybercongress/go-cyber (thats the protocol repo)

The short answer is yes, you can access it

1 Like

I wonder if any blockchain-based system can be “interplanetary”. Up to 24 minutes of the latency between Earth and Mars easily kills it (miners/validators cannot be distributed at least). Also, one of the super fancy goals of IPFS is to be partitionable, that is:

Info and apps function equally well in local area networks and offline. The Web is a partitionable fabric, like the internet.

This is also a reasonable requirement for interplanetary system, and seems to exclude the current blockchain technology which assumes the existence of the global, unpartitioned internet.

1 Like

I have to agree with you, @sinkuu That’s why it is not such a bad idea to have no global index, but rather small crawler communities,
that eventually exchange their findings, but don’t stay in constant contact with everything else.

1 Like

I think that my MVP Crawler kind of works.
It creates a reverse index, mapping keywords to CIDs in the JSON Format.
I didn’t yet get it to automatically publish to IPNS, because that seems to take very long.

Hi! I created Cyber and would love to address your concerns.

I wonder if any blockchain-based system can be “interplanetary”. Up to 24 minutes of the latency between Earth and Mars easily kills it (miners/validators cannot be distributed at least)

You are correct that It’s likely we will not be able to sync one chain between Mars and Earth. But we should not. I am pretty sure semantics on Mars will be very different from the Earth semantics.

That is why we defined the mechanism which allow to prove the rank for any given CID from Earth chain to any knowledge graph which will run by Martians. So you will need to sync only ranks of anchor CIDs back and forth using some relay.

1 Like

I am pretty sure that solution 5 will not be able to work without solution 4.

  1. You can learn from Yaca, that you cant build the search engine which will be useful following complete bottom-up utopia. The reason for this is quite straightforward: relevance have to be somehow protected from trivial sybil attacks. You can not achieve this without some economics. And yep, you cant add economic layer without dlt due to double spends.

  2. Another problem with bottom-up utopia is that due to inability to have the full index such search will never be able to answer questions better than top-down solution.

  3. Top-down approach must not be complex and centralized around one blockchain to rule them all. Сheck wp

  4. I am pretty sure that it is a good idea to develop bottom-up utopia on top of top-down so you can get the betterness from two worlds.

2 Likes