Decentralized Search Engines for the Decentralized web

We need decentralized tools for indexing the content on IPFS and searching through that content in a decentralized manner. One option (of many) is to modify YACY to crawl IPFS content as well as classical web sites.

Related: the ipfs-search project is looking for a maintainer.

4 Likes

This may be a noob question, but how is a search engine able to label/name an object, if it’s a single file that is just represented by an IPFS hash. It could be anything, from a docx to a zip to a PDF to a DMG, with no discernible filename.

It may be impossible to find the filename for some content just from the file data, but a lot of files on ipfs come contained in directories (e.g. if added recursively, or with ipfs add -w), containing a named link to the file.
For example:

QmDIR... - Directory, containing:
  cat.jpg: QmAAA...
  secrets.txt: QmBBB...
  movie.mp4: QmCCC...

After indexing the QmDIR… directory (for example, finding it in the DHT), the search engine will see the filenames that the hashes QmAAA…, QmBBB… and QmCCC… were given, saving them in some index.
Now, when the file QmBBB… comes up in a search, the system can look at the filename index, and see that the file was named secrets.txt.
I’m not sure if this is how ipfs-search works, but it’s one of the possible ways to implement a way to bind filenames to indexed hashes. Another is to search for the hash of the file you want the filename for. The directory, having a reference to that file, will come up in the search.

1 Like

For these “anonymous IPFS objects” (added without the -w option) you might at least be able to get some information on the file type, e.g. with

ipfs cat Qm... | file -
/dev/stdin: Zip archive data, at least v2.0 to extract

But the search engine would have to ipfs cat (i.e. actually download) the object first. So maybe file type detection should eventually be built directly into ipfs. (Possible?)

However, for some file formats that information seems to get lost in IPFS; if I cat and file-examine a local DMG, it tells me bzip2 compressed data, block size = 100k, but when I ipfs cat and file-examine the DMG directly from the IPFS, it only says data.

2 Likes

There is a need to index content as well as filename. Think like Google Drive or Dropbox service so one can search his own PDF or doc by name or content. The next thing is what if a node trying to index encrypted storage (permitted by the owner). Both indexing and encryption should be baked into the IPFS.

I also want to add metadata to a hash, for example to identify it or tag it for easier searching.

If it’s the song Sunshower by Rza, I want to add it to a decentralized wikipedia

i dont understand how everything is encrypted if any node can visit any ipfs site and watch the code and download it?everything is delievered encrypted between peers.

During that time (and probably until now) there is no encryption by default for files or contents.
Anyone can always encrypt the file before added into the IPFS manually.
Encrypted content doesn’t mean to show it publicly but I’m referring to the owner (or anyone permitted) to search trough his/her content.

I could see in ipfs-search that they are sniffing the DHT gossip and indexes file and directory hashes. It takes IPFS hashes and extracts content and metadata through Apache’s Tika. Then the indexing and searching is done using Elasticsearch 5.

I also found another tool by Consensys IPFS-Store that is an API on top of IPFS with search capabilities and much more.

I could also find https://ipfs.io/ipfs/QmYo5ZWqNW4ib1Ck4zdm6EKteX3zZWw1j4CVfKtnAzNdvu/ another search tool built using searx.

1 Like

For me, the problem with searching the DHT or the vast ocean of hashes people will be generating in the future, is the amount of useless hashes that need to resolved to discover content. We should focus more on searching IPNS.

I do it manually sometimes for peerIDs…when I should be sleeping…ipfs swarm peers… ipfs resolve …ls …get <hash that peerID has published…not very efficient, though I’m sure someone could write javascript app make it easier. And this is just peerID…not to mention the files added that are not published.

As for searching ipfs objects directly…this will also be inefficient. I add my website every time I change something, creating a new hash. Does anyone really want to search through 50 copies nearly the same content? or each time you add the same folder with new content.

My suggestion:
Add a distributed database to an IPFS client like Siderus Orion or IPFS Manager. Alternatively, files could be added/indexed/verified with a service like Stamp.io…But I prefer native interactions.
When adding files, the users could add tags i.e. title/description/username, and the client adds the tags file size/type/date added (very important).
The local DB is updated with a new hash which could be broadcast to/syncronized with connected peers. In this way, when a user searches the DB they will be directed to the latest version of the content, with the choice to also view older version (if the file still exists somewhere.
Websites or companies with large data storing capabilities and a lot of content such as Google, Instagram, Wiki, Linkedin, would be huge contributors to the trusted hash list. Having synced with large “trusted” entities, the users would get the latest search results.
Other options such as rating content as “safe, explicit, offensive, harmful” could help remove negative content from the web.

1 Like

This relates to publicly shared info.
As for private info, I think that should be handled before adding to IPFS. For simple things just create a .rar with a password attached, then add to ipfs. Next share the link+password with the recipient. It’s not military grade, but it doesn’t require any special software.

I’m hoping this function will also be implemented into clients.

Check out cyber. Its a decentralized search protocol for web3, that utilizes IPFS.

The short story is its a decentralized knowledge graph with CIDs created and taught by users. Blockchain helps with spam protection and incentives.

1 Like

YaCy (Yet another Cyber) is great, but the maintainer says it is being deprecated in favour of YaCy grid, which is being developed.

I think IPFS could benefit from a way of associating some sort of meaningful tag/descriptor to the hash. That would help an IPFS search engine hash too.

Hi, fuckGoogle!

Could you explain a little bit about what to do at Cyber, if you try search there and there are no results?

1 Like

Cyber is built around cyberlink concept. Cyberlink - is a directed link between 2 ipfs hashes.

It doesn’t crawl and add to an index by itself (at least not now - later is possible). Instead, it is a user generated knowledge graph. Anybody can submit a transaction with cyberlink to a blockchain. What blockchain does is compute ranks of all CIDs submitted and can answer with sorted results. So if you ask for some hash, it will always return cyberlinked CIDs sorted by rank. If you ask something that does not exist in the knowledge graph, the answer for 0 is returned (hashes cyberlinked with CID computed from ‘0’).

The system is young and has not even reached mainnet, so the existing knowledge graph is not ready for its prime time. But you can join forces in the formation of the best knowledge graph for humanity

2 Likes

Hey man!

The search on top of Cybers protocol is an app, just like any app on the ETH protocol.

I suggest you start of with this 45 second video

And then i would say go to the FAQ… but, we are actually in the midst of reshaping how we explain Cyber.

The short-ish version is this:

Cyber is shaped as a superintelligenct organism in a form of a consensus computer for answers. Cyber is managed by its users and token holders without any jurisdictions and CEO’s. It is structured as artificial intelligence, in that it can provide answers to questions from the derived information taught to it by its users.

Cyber is an innovative search protocol that provides provable answers, without an intermediary opinion. With the help of cyberlinks, blockchain, IPFS, Tendrmint, cryptography and other, distributed, technology - it can index and rank data, which removes blackbox, intermediary opinions in the formation of the semantics core of the internet. This helps to decentralize the infrastructure of the new coming Great Web.

Cybers allows to design a trustless, provable and incentivized method of communication between those who provide content and those searching for it. Effectively removing censorship and creating a world of opportunities for such fields as, open AI, digital marketing, social networks, search mechanisms, decentralized oracles and much more.

Cyber is currently working on autonomous and programmable smart contracts, which will take it another step closer to becoming a superintelligence. Cyber has 2 DAO’s which help to manage it. A community pool and programmable chain on top of the Tendermint consensus. Where users can manage the chain, vote on its parameters, its inflation, etc. And an Aragon, ETH-based DAO, which acts as a decentralized venture fund and bank for Cyber. Hence, separating money (ETH) managed by the users and the state (the chain)

Currently, there are two open-source and functioning apps on top of Cyber. A decentralized Google analogue and a decentralized Twitter analogue.

Cyber is still in testnets. It’s 100 percent open source, of course. The code of the protocol can be found here. Other repos (search, twitter, etc) are also open and are available at the same place

PS. Old guides, but still give an overview of how the app works

2 Likes