This is a question, that has occurred to me today.
There are currently, to my knowledge, no trust-less, distributed ways of searching through the content published on IPFS.
Would there be an interest in a search engine, that followed the IPFS principles of trust-less and distributed?
I would say, that such an engine, should make it possible, to search through the content published on IPFS and give anyone the ability to implement their own search algorithms, determining, what kind of content they want prioritize.
The IPFS distributed search engine would have three parts to me:
Scrappers, that download files from the IPFS Network and create Index Files, which they then upload back onto the Network.
Index Files, a special file format, that is versioned and uploaded to the network. These Files contain entries (CIDs) and a set of keywords, that the scrappers classified as relevant to these CIDS as well as meta data.
Algorithms, that use these Index Files to, given a search term, find a sorted list of CIDs relevant for said term.
In the end, this would mean that if somebody wanted to search the IPFS network, they would download one of the Index Files, use an algorithm of their choosing to find the relevant search results (CIDs) and then download those, they want to look at in more detail.
Is there any interest in such an IPFS application?
The Index File
To implement the Index File, I want to add, that I conceptualize them as a DAG, where every File refers to one or more other Index Files, with other CIDs. Thus we can easily establish a trust algorithm, similar to Blockchains, where an Index File (singular) is trusted more, the more Index Files refer to it.
So far, I’ve setup a public gateway and figured out how to setup dnslink DNS entries as well as posting static content to a new key linked to a regular DNS subdomain.
I’ve experimented a little with pubsub…
My very limited experience so far leads me to the suggestion that IPFS search would be quite useful if there was a pubsub-like feature that allowed advertising of content to be added for search via some IPNS flagging system…
something like this:
ipfs name publish --key=mykey --search=true [hash]
Yes, I think there should be an agreed upon default format (like HTML for the W3), with some way of adding search keys (keywords in HTML), that is preferred by search algorithms. But this should (a) not be mandatory and (b) not exclude files, that are not in said format. e.g. in W3 Search Engines, there is preference for HTML, but PDF Files are not by default excluded. (HTML wouldn’t be such a bad choice here)
In the proposed design, I have aimed for two things, that at least ipfs-search doesn’t seem to achieve: Distributed and Modular.
Distributed means, that the Index Files are actually uploaded to IPFS and can be downloaded by anyone.
Modular means, that although there is a common Index File Format, both scrappers and searchers can be implemented however one likes. They interact through the common File Format, but how they choose the files to scrape or how they rank and analyze the files, should be mostly the matter of the scrapper / searcher given the data in the Index File(s).
I have tried to use both these engines, but have only succeeded with the first. I think, these are not entirely sufficient for IPFS, because as far as I can see, they don’t embrace these design principles of distributed interplanetary networks. I think, that an IPFS search engine, should also be Interplanetary, just as the Network.
There is also YaCy, a decentralized search engine (for legacy web) that is more aligned with what you are envisionning. I think it is a good starting point for maturing the reflexion on how to port it to IPFS.
My idea is truly very similar.
I would like to create the IPFS Serach Engine (IPSE?)
with these three components:
An Index File Format (INDEX)
A crawler, running on top of an IPFS node, indexing every File that it finds in it’s peers
A searcher, running on top of an IPFS node, fetching the most recent version of the INDEX and using it to search.
The INDEX File is really the heart of such a search engine. Because everything else is quite easy after that.
The INDEX File would be a DAG, where every new entry to INDEX would refer back to one or more previous INDEX Files. An INDEX File would consist of entries, from CID to data about the CID. The data can be determined later, but has to always be optional.
While loading the INDEX File, the most recent version of the same CID are used. If there are two contradictory versions of the same CID in the INDEX, which means two scrappers found the same file twice and found different data in it, than they will be both disregarded.
If there are two different tips of the INDEX available, the most common denominator is found and the number Files after that is determined, then the INDEX File with the longer chain of trust is used.
This is how I want to get rid of the trust in individual crawlers.
I know this idea isn’t perfect with or without proof of work, which is why there would really be a better distributed system to wish for.
If it is possible, than I would argue, that we should default to preferring HTML Files in INDEX Files.
This is because, when HTML Files are found, they are often the gateway into a Web Application and thus more useful than JS or CSS Files. And HTML provides all the Syntax needed to optimize a search engine for. (keywords, descriptions, authors, etc.)
The name is already taken unfortunately.
Apparently they also want a decentralized search engine, and put a blockchain on it for incentivization. Haven’t read their paper. Seems very complex coz blockchain.
Some general thoughts, although I’m not coding with IPFS myself:
Maybe the Index could be an OrbitDB (decentralized DB on top of IPFS) or something on top of it.
Indexes should merge with CRDTs? I think it’s baked in OrbitDB though, so you should be fine.
I think Indexes should be extensible, like for example JSON where you could add more fields. Newer versions could add metadata (“license-of-page: MIT”, “adult-content: false”, “find-on:www.foo.com/theRessource”, “date-find:20200525”,etc.) so that the newer algorithms can filter or sort the content, and the older can just ignore the fields they don’t know.
I don’t understand that part. A CID is immutable and will always map to the exact same string of bytes. If two scrappers find the same file with different contents, it’s actually two different files, with different CIDs. Maybe you are talking about IPNS records, that are mutable and can point to newer and newer CIDs? In that case worry not, IPNS records will soon be signed along with the date, so you can reliably keep the newer one. I don’t know how much more disk space it will take to keep the signature for later verification though…
What is a “tip”? I wouldn’t build the Index like a blockchain. A blockchain typically stores transactions where their validity depend on the validity of the previous ones. A blockchain needs you to be reasonably sure to have the latest version to add. This means nodes can’t add every millisecond or you ill never know the last version. This means they have to add in batches. This means you need to enforce who adds to the chains and when… etc.
You don’t need that, just use a CRDT !
What you will still need is that other crawlers don’t add junk to the Index. Obvious junk: “I found ‘cat’ in this bike-selling company website. Also, I own this company. Also, maybe I lied and the page doesn’t contain ‘cat’. Anyway, wanna check out the website?”.
YaCy solves this by “keep the crawlers crappy enough so that running them is more a hobby than a used feature for the web”. We can do better.
Solution 1: Centralize, be Google.
Solution 2: Make it run by a trusted foundation (like Wikimedia). Spoiler, they won’t because it’s expensive, and you still have to trust Wikimedia.
Solution 3: Make your trusted foundation award licenses for trusted crawlers. (Aka: your crawler actually did a good job based on our checks. Its results can be merged into the global database).
Solution 4: Top-down utopia: the Blockchain. Decentralize, and make crawlers pay to add on the Index so that it is costly to put junk online. Following the payments means a leger. Decentralized ledger means a consensus, meaning people running it, meaning incentivization + common and automated and not flexible way of deciding what is junk, meaning more work, meaning more incentivization. basically, build a cooperative world where nodes are forced to not trash the network. Not impossible, just very complex. See Whitepaper above, I guess.
Solution 5: Bottom-up utopia: Local reputation system. Let anyone add their junk if they want, you will just keep count on who added what, and who is trustworthy. You won’t be able to fetch all the index because it is too big and full of junk. So you will fetch it from neighbors and build a reputation for them, according to you. If people sign their contribution, you can ignore bad contributors and unsigned contributions, and trust more the “good” contributors (according to your local definition). The MVP is way simpler. You can then add peer exchange, a web of trust, multi scoring (this peer is fast, this one has more results), etc. Way simpler, way easier to customize, way more resilient. Problem: it’s more reliant on local node resources, and you’ll never have the global index.
I would go with 5. Personal taste. But that means no central index.
------ The philosophical metaphors 'coz I have too much time.
Solution 1: This is the army/a dictatorship. You don’t have choice. You don’t get to decide how it works. But it is so goddamn efficient.
Solution 2: This army is also a dictatorship, but a good one. I hope they win.
Solution 3: Oh, they hire mercenaries. I’ll enroll.
Solution 4: Let’s make a nation. Let’s create a big infrastructure. It’s complex, it’s fat, it’s expensive, but it will be eventually efficient, we can rely on it, and it eases communication with random peers because it’s a common standard and the infrastructure makes sure that the other peer will behave. Guys, let’s build better roads once and for all so that we don’t need complex cars for the majority of the trips. This is about performance and efficiency.
Solution 5: Let’s make small towns. Infrastructure is costly to maintain. I don’t want to adhere to a standard, I want to fine-tune my experience and build my own web of trusted individuals. I don’t need to interact often with random peers I don’t know. Besides, the infrastructure builders optimize for a use case, which is not mine. I prefer the flexibility, customization, and resilience. Guys, let’s build better cars so that we can always go where there is no road. At least I will do that. This is about flexibility and resilience.
Yes, I would think 5. goes in the right direction.
I don’t think that my idea with Blockchains was very good either.
But maybe, there is a way of doing this, that isn’t adding in a lot of junk to the INDEX and also doesn’t make the world too complicated.
We need a way of establishing consent.
We have two classes of peers in the Search Engine:
A Crawler is a peer, that adds to INDEX and verifies other who add to the INDEX.
The crawlers use Public Key Cryptography to sign off on each others updates and to the INDEX.
Before a crawler signs off on an INDEX File Change,
they verify the changes.
Every searcher keeps a list of peers they accept as crawlers.
This they can do, first of all, because they can decide to trust somebody or because they have a version of INDEX already
and thus the Identity of the crawlers, that signed off on those changes.
This they do, both on the INDEX File and by putting their currently most recent and supported version of INDEX on their IPNS.
And when a searcher wants to fetch the new version of INDEX, they just look at the IPNS records of these trusted crawlers and fetch the version, that the majority of those currently has put there.
The advantage of this is, although a group of crawlers could potentially split of the other crawlers, they will still mostly have an eventually consistent global INDEX.
Those splits would be very much intentional though.
To start a new INDEX File, you would create:
An INDEX File Format / Database.
Implement the crawler.
Implement the searcher.
But the question is, would there be a use for such a search engine?
The Problem is, I neither want a central authority nor do I want this do be a free for all.
Thus, before a crawler accepts a new version of the INDEX,
they’ll want to verify each change and before a searcher accepts a new version of the INDEX, they should see which version is most popular among the crawlers they trust.
Will they publish their verifications to the public? Send them to the peers they are connected too? Give them on a request basis? Or keep them locally?
I think they should send them on a request basis.
I think there won’t be a final INDEX File. I doubt we can have an eventually consistent INDEX file. One of the reasons being different crawlers will disagree on what is a valid update, and so will different searchers.
Every searcher will do its best to know a set of trusted crawlers, and the searcher’s INDEX file will basically be the union of the crawlers INDEX File they know and trust.
Depending on implementations, searchers could either query the crawlers for results, or ask them their Index files to build their own Index locally (faster queries, but more maintenance costs).
(I can also imagine a third type of nodes in the models: Brokers/Coordinators. They will fetch Indexes from crawlers to build a big Index so that Searchers can query them. To Crowlers, they look like Searchers building their Index. To Searchers not building local Index, they look like Crawlers with big search capacity. )
Do we? I think the model is BitTorrent here: find nodes to cooperate with, define “cooperative” for your own use-case/implementation (any node, a node seeding enough, a fast node, a node without data but useful on DHT lookups,etc.) and work with them.
I can see different implementations for Crawlers and Searchers, with different criteria, and it’s fine.
Let me introduce you several implementations serving several use-cases:
The general purpose Crawlers. It listens to DHT traffic for new files and tries to increase its index every day. Most searchers use it.
The skeptical lazy verificator. It listens to the queries it receives and double-checks the results of other crawlers. Searchers like to query it because this fact-checker is always up to date on hot topics and reliable (unless you are the first to query).
The Grammar Nazi Crawler. This guy crawls the same source as everyone else, but it is looking for something different. For example, it indexes well-written texts, and check for syntax errors so you can be sure to read quality post.
The video Indexer Crawler. It fetches some frames and looks for cat videos. Most Searchers don’t bother contacting it, but cat lovers do.
The Greedy Bastard. It built an annotated pattent database and found similarities between some. Pretty neat, huh? But you’ll need to pay a few coins to make a query :/.
The General News crawlers. It crawls hot topics and makes a selection of articles. They say it works closely with the Grammar Nazi Crawler and the Is This Well-Written Crawler…
The Times crawler. It doesn’t crawl much. But if you want a Time Magazine article, ask for it here.
The personal social media Crawler. It subscribed to your friends’ nodes and indexes their posts. You can catch up with your friend David’s latest avocado toast pictures in one query.
The Hosted-With-foo crawler. Foo.com lets you host your website on IPFS easily. They have a crawler only for the websites they help setting up, so that on their front page, they can say: “A guy did this incredible website with our tech!”.
The Crawlers Crawler. Aka the Broker (see above). It sucks in all other Indexes and you can query only it to go super fast if you want. A bit centralized, but hey, you decide.
General-purpose. It can query all the general-purpose indexers and most of the others.
Specialized. Searching scientific papers. It gives its queries a super complicated and confusing format. Luckily, it is compatible with the Scientific Paper indexer.
The Cat video searcher looks for Cat videos. It leaves in a dedicated Desktop App and queries only the Cat Video Indexer.
More generally, there is a specialized searcher for every specialized crawler.
The Overly Patriotic Searcher. Queries general-purpose indexers, but filters results from IP too far from itself. Its nemesis, the Globe-Trotter Searcher, does the opposite.
The Fake News filter. It lets you search for any hot topic and serves you a random cat video instead. It’s for your own sake.
And there’re many more…
Where were we?
If there is no central authority AND no consensus (blockchain enforced or protocol enforced), you won’t have it, I think. But it’s fine. We’re going down the Solution 5 road. There should be a standard way to communicate what type of crawler you are and what type of cooperation you are looking for, if any, but there won’t be a standard way to build trust. Effective ways of communicating what you do should help the network not contacting you if they don’t like how you work, though.
I suspect most searchers won’t want to have a huge Index locally. They will keep no Index locally except for cache, and verify the results, not the whole merged Index. Some will want to build a local Index without also being full-on crawlers, but they will either ask for some small specialized Indexes, or check the provided Indexes probabilistically, or just check the trusted crawlers’ signatures.
I think you touched and descriped many important aspects. I am convinced, that as the general model, yours is quite sufficient.
There are Crawlers, that Index IPFS or other crawlers how they see fit and publish their results to whoever wants to listen.
Then there are Searchers, that choose the kinds of Crawlers they want to trust and fetch their Index files
to perform searches.
But I think, that there should also be a low cost way of establishing pools of crawlers,
that work together on an index with either a certain guideline, rules, etc.
Like for example a crawler pool for The General News crawlers.
In a pool, there are common guidelines and maybe even programs for the crawlers to adhere to.
A pool has a single index, that I will call INDEX for now.
And if a crawler of the pool wants to make a proposal to add or remove something from the INDEX, these changes will be verified and voted upon by the members of the pool by their own rules. After that they will update the INDEX.
Through this, broker crawlers won’t have to keep a connection to every “General News Crawler”, but instead just fetch the current version of their INDEX.
Through this, small crawlers can still be utilized and new crawlers have an entry point into the system of trust, that the brokers or searchers establish.
Otherwise I think, that instead of proposing an actual protocol between crawlers and searchers, we should use IPNS and define what kinds of file should be put into IPNS, in order to be read by other searchers and crawlers.
For example: We could say that every crawler should have the current version of their index on their IPNS in a file /index or something like this.
But if there is such a need as you are saying that it is and as I can imagine there to be, then I am asking:
What kind of INDEX File Format should be used?
Should there even be a defined standard?
What kinds of search index should be supported?
There is clear benefit to providing such a standard:
The interaction between different crawler/searcher communities is not
But the problem is, that such a standard is really hard to achieve.
Looks like a good model. I would organize the pool as following (mostly summing up your design).
Run a few first crawlers for bootstrapping purpose.
They connect to each other and subscribe to a common topic: generalNewsCrawlers/v1.
They run a common OrbitDB in CRDT mode
They add “records” (let’s call it that way) to the DB (aka pool INDEX) along with their signature and metadata (maybe date, a provider (to shorter lookup for searchers), possibly the IPNS record or DNSLink, the size, a guess on the type of file, etc.)
To verify others work, they could take the records of others, reindex it and compare the result. If it is the same, they add their signature to the record in the index and increase the confidence they have in that fellow crawler. If not, they broadcast a message on PubSub: "Crawler X made a mistake on CID Y. It did Z when we expected W. See their signature of the faulty work: S. Here is my own signature for this alert broadcast. "
Upon receiving the alert message, the other crawlers check who is faulty (the bad indexer, or the whistle-blower?) and either relay the message or drop it and bring shame on the faulty whistle-blower instead. The faulty node is penalized, and a honest one is rewarded (increased confidence).
Below a certain confidence threshold, honest nodes drop the bad crawler from their local routing tables.
Searchers connect to one or several crawlers. They do not subscribe to generalNewsCrawlers/v1 which is only for crawlers (if they do, they won’t be penalized but will quickly be dropped as crawlers prefer keeping connections with honest crawlers, not cute but non-indexing searchers).
They query one or several crawlers. They got result back. They should have almost the same result from all crawlers (not exactly, as they are always indexing). These results have different signatures form different crawlers. Searchers compute the union of the results.
The searcher then filters and order as it sees fit. Possible criteria to combine: the order proposed by the crawler, number of signature (more trusted result), date it was seen first, different weights for the signature of different crawlers, etc.
Crawler joining the pool:
Jonny (joining crawler) connects to some Crawlers of the pool. He will enter a trial period and will seek the approval of a “senior” crawler who has successfully passed this trial and is in the pool.
Jonny asks a senior crawler for CID to index.
Jonny indexes them but doesn’t publish them to the Index. He sends them to the senior.
Senior crawler checks (some of) the results. If they are good, it sign them, it puts the record with both signature on the INDEX, and finally gossip about Good Jonny wanting to join in the PubSub channel. If Jonny tried to publish himself to the INDEX, or if results are not following the pool standard, they are rejected and Jonny is badly gossiped about and he has to start from scratch again after a backoff period. This backoff global as all seniors know when Jonny was bitten by his senior. If Jonny tries to make his senior check a too big file, he is punished too as he tried to DoS a senior. The Senior informed him of the maximum size, or it’s written in the pool’s rules.
Senior challenges Jonny with bigger and bigger files.
The more Jonny deliver, the more he is trusted by his peers. If he fails once, he starts from scratch.
After a lot of successful runs, he becomes a regular crawler and publishes himself to the INDEX. The senior send Jonny his diploma: a message saying that Jonny has now graduated, thanks to the Senior, + the diploma of the senior. This a chain of trust that can be traced to the original bootstrapers.
-Jonny can become a senior for a new Jonny, too.
Bootstrapping the pool:
Each peer wants to be a Jonny.
After some time, they see they didn’t find any senior crawler.
They say to N other Jonnys: "I didn’t find a senior crawler to send me tasks. Do you want to be my Senior, and I’ll be yours? "
The other Jonny either say “yes”, or “I found a Senior at this address. You should contact them, or try again with me in X time (I hope to have graduated by then).”
There is a possibility that several groups consolidate on their own with a risk of netsplit. To avoid that, seniors in each group can vet each other following the same process. They are Jonny’s in each other’s network. After successful vetting on both sides, their two INDEX and their two networks should eventually merge.
Depending on available resources, trust level among the pool, and pool size, crawlers may want to only check a fraction of the record.
We may want to check more for newer crawlers, with a low score, to evict bad ones faster, and not spend resources to check old, honest reliable nodes. However, this is a DoS vector (Sybil nodes spawning rapidly, joining and making honest node check their crap. Be evicted, rinse, repeat).
Bad crawlers have signed some records on the INDEX. How do we make the Searchers not trust these results?
– Revoke access to the INDEX and delete their record. How? And we lose information about bad peers.
– Make crawlers not send the records that were rejected.
– Make crawlers remember bad crawlers and not send the records that were sent by rejected crawlers (unless verified by someone else)
__ They can have a parallel OrbitDB which is a list of bad peers, along with the proof(s) of their faultiness.
How to prevent Jonny the Joiner to index useless data he generated and present that as results to the Crawlers? He will quickly earn the trust of the pool and be able to DoS it.
– Maybe Make the Senior send him some work to do first.
------ How do Crawlers know the message they just received saying a distant honest senior node trusted a distant Jonny with a result is legit? This unknown distant senior may be a Sybil.
---------- Should all Crawlers check Jonny’s results before trusting him? That increases DoS vector and doesn’t scale well if high churn.
---------- Alternatively should they check that Jonny’s Senior(s) was vetted by another crawler that was vetted by themselves (find a web of trust path going from Jonny to the skeptical crawler)? Then the web of trust should be stored on yet another OrbitDB, or be redundant enough to compute it on the fly by jumping from node to node. But long-range attacks could infiltrate good nodes that then introduces bad peers and say they trust them.
---------- Alternatively, we trust Jonny by default. BUT, we test our fellow crawlers regularly. If Jonny fails, we decrease our confidence in Jonny by 1, and his senior by 0.5 (to be tuned)
The rules of the pool will determine what is a good contribution and what is not. I guess the pool will provide an implementation to run.
Organizing a vote is tricky because of Sybil nodes voting power. Weighting by node “reputation” is tricky because the reputation is local.
Okay, I will stop now, this is getting out of hand.