Any suggestion to make IPFS content searchable/addressable by a user defined tag?

I’m thinking about implement a web proxy (a forward cache for client side browser) using IPFS as a distributed web cache. The HTML pages are cached in IPFS for future access.

Here the problem is that the web cache is addressed by URL (or URL hash), not the page content hash. Is there a way to be able to tag the page content and make it addressable/searchable by URL?

Use a separate index if you want true search functionality. If exact queries are enough, you can use regular IPFS directories.

http://www.mysite.com/about/about-us.html -> /somedirectory/www.mysite.com/about/about-us.html

How will you handle malicious nodes inserting garbage?

You should also look at https://github.com/oduwsdl/ipwb They’ve been tackling that issue comprehensively in order to build ipfs-based web archives.

3 Likes

How? All I can see is that they extract warc archives to take advantage of IPFS deduplication, they’re still referencing them by hash.

@flyingzumwalt Awesome, ipwb is pretty much what I want. I’ve been looking into OrbitDB (https://github.com/orbitdb/orbit-db) too, to use IPFS as a KV store.

@es_00788224 @flyingzumwalt Thanks a lot!

1 Like

You are right. ipwb didn’t make uri searchable in IPFS. What I take from ipwb is:

  1. use WARC for packing web content
  2. CDXJ file content has records with url and timestamp, plus both header and content IPFS hashes.

While skimming the code, I thought ipwb uses IPFS directories for CDXJ urls but obviously they didn’t do it. Seems ipwb has very limited functionality.

I’m going back to OrbitDB.

Why do you need orbit-db? A directory is a directory, just treat it as one:

IPWB is just a way to put WARC files on IPFS.

1 Like

While IPWB is not just a way to put WARC files into IPFS, but for now, this statement holds true to some extent, that is just the indexer part, not the reply side of it. It is not difficult to build a transactional, on-demand, or client-side archiving system using IPWB. However, IPFS is not directly usable for the purpose described in this thread, but some techniques can be borrowed from it to implement such a system. Here are a few assorted thoughts around it, some are utilized in IPWB and some are not.

  • Split the payload and headers (as done in IPWB), push them in the IPFS and create a CDXJ style index file (very much like how it is done in IPWB), then at replay time only serve the most recent entry for that URL in the index. Splitting headers from the payload would allow greater deduplication as headers will be unique each time while content may be duplicate more often. Additionally, at replay time you need to decide how to replay the response headers, what to modify, add, or remove in it as a proxy.
  • Since older entries corresponding to each URI are not needed, one can eliminate the need of separately maintaining a CDXJ index and utilize IPNS for the same purpose.
  • Whatever approach is used, don’t forget to canonicalize the URL or better yet, use SURT form of URIs (as in IPWB).
  • Some folks at Virginia Tech have done some transnational web archiving by putting a middle layer that serves the live content, caches it, and serves the cached copy if the main service is down. They have recently extended that work to use Redis bases in-memory cache and presented a poster “Web Archiving Through In-Memory Page Cache” in WADL 2017. Some ideas can be taken from that approach too.
  • Any good proxy/cache should take care of the Via header that describes all the headers that can be used for content negotiation. For example, an exact same URL can be used to serve the content in many different languages using Accept-Language header by the client (if the server advertises that it allows content-negotiation on that header). If a proxy/cache does not consider the Via header, it would overwrite the cache with the content in a different language, if the URL is the same. As far as I know, VA Tech research mentioned above does not consider Via header as it is still only a research project and not a usable product. Additionally, in case of IPWB, archival records are often explored just by a URI and a datetime, so other possible content negotiation dimensions are not part of the index.
  • Making your content searchable by URLs while the content is deduped and stored in the IPFS, you can utilize IPFS Search (service is dead, but the code is available). We were thinking about a similar idea of full-text search in IPFS as well.

Hope, this would help you come up with something that you can share with the community.

2 Likes

Add pointers to each CDXJ record at /ipns/YOUR_HASH/THE_URL.

Unless you’re planning to build something extremely large, a simple reverse word index (also distributed over IPFS) searched by some quick JS code with boolean operators is enough for most purposes.

1 Like

We have thought about a few approaches around the full-text search implementation. here are some assorted points in that regard:

  • The search indexing mechanism is out of scope, it can simply be a reverse index, trie-based data-structure, or something more sophisticated. Eventually we will be using some existing full-text search system (such as Elastic Search) so no need to reinvent the wheel on that level. There are existing systems that would take care of removing the boilerplate template data and only drain the main content out for indexing.
  • From full-text search perspective HTTP response headers are not that important, only indexing the payload should be good enough.
  • Rather than treating each CDXJ entry as a document to be indexed, we can use the payload hash as the ID of the document and index that independently. This way, multiple documents with the same content will not be indexed as separate documents (hence deduplication). This will significantly reduce the size of the index from archival perspective where deduplication quotient is high.
  • An extra layer of a reverse index of Content Hash => Array of URI-Ms (where a URI-M is a pointer to a memento, i.e., an entry in the CDXJ) can be maintained independently for final presentation purpose which would require some clever ways to rank or bundle competing entries with the same hash (of the same URI-R at different times or different URI-Rs).
  • Content-type can be used to filter records that are meaningfully indexable. For example, text, HTML, PDF, and other similar contents can be sent to full-text indexing system (such as an Elastic Search or Solr server) while images/videos can be sent to an image classifier to potentially flag NSFW content or to a more sophisticated object recognition system.

@mfan if you’re looking at orbit-db you should also look at @pgte’s work with CRDTs on IPFS. It does the same things as orbit-db but uses yjs to handle the CRDTs instead of relying on a homegrown implementation, which makes it easier to use.

@Kubuxu can you link to the code we used to do serverless querying in the wikipedia snapshots?

Here: https://github.com/magik6k/distributed-wiki-search

Blog post with more info about this at: https://ipfs.io/blog/29-js-ipfs-pubsub/

@flyingzumwalt @ibnesayeed @Kubuxu @es_00788224 @daviddias Thanks for all the information. They are very helpful to get me started with some experiments.

For the web proxying, I did a simple experiments with Proxy2 (https://github.com/inaz2/proxy2) on popular websites, e.g. cnn.com. The http and page content are fetched and cached using their URLs, and get replayed at a later time in the browser. It seems relatively easy and no blocking issues:

  • need certain site specific url normalization/rewrite to de-dup and help improve caching efficiency.
  • ad filtering (Ad traffic are huge)
  • better HTTPS support

Caching on IPFS

In my experiment, I used IPFS unixfs-dir and IPNS. It’s very slow.

     ....
     ....
     # Received a HTTP request, check for page cache in IPFS
     uid = hashlib.md5(req.path).hexdigest()
     ipfs_path = '/ipns/%s/%s' % (IPFS_BASE, uid)
     try:
        # Return the page cache (http + page content) to browser.
         page_dict = self.api.get_pyobj(ipfs_path)
         print 'URL %s found in ipfs: %s' % (req.path, ipfs_path)
         print 'UID %s found in ipfs: "%s"' % (uid, page_dict['body'][:60])
         print 'UID %s IPFS path: %s ' % (uid, ipfs_path)
         self.wfile.write(page_dict['header'])
         self.end_headers()
         self.wfile.write(page_dict['body'])
         self.wfile.flush()
         return
     except Exception as ex:
        # No cache, go fetch the page.
         print 'URL %s not found in ipfs: %s' % (req.path, str(ex))
         print 'UID %s not found in ipfs.' % (uid,)
         print 'UID %s IPFS path: %s ' % (uid, ipfs_path)
         print 'Fetching the page... for %s' % (uid,)

     ...
     ...
     # Fetch page content.
     ...
     ...
     # Downloaded the page content, store it into IPFS.
     uid = hashlib.md5(req.path).hexdigest()
     with self.lock:
         ipfs_res = self.api.name_resolve('/ipns/%s' % (IPFS_BASE,))
         root_hash = ipfs_res['Path']
         page_hash = self.api.add_pyobj({
             'header': header_file.getvalue(),
             'body': page_file.getvalue()
         })
         ipfs_res = self.api.object_patch_add_link(root_hash, uid, page_hash)
         ipfs_hash = ipfs_res['Hash']
         self.api.name_publish(ipfs_hash, key=IPFS_BASE_KEY)

I’m afraid this design is not going to work in a production environment, due to the contention when publishing the IPNS name (IPFS_BASE) with a shared key (IPFS_BASE_KEY) from many nodes at the same time. The performance is also not acceptable, it’s extremely slow even with a single flow in my experiment. Any suggestions to fix the contention and have a working design using IPFS directory?

For the project, I’m thinking about using IPFS to help people gain access to blocked internet content.

For a country which has national level Internet censorship, people usually use VPN to workaround the national Firewall to access blocked content or websites. We could use decentralized IPFS to circumvent censorship and make those content available to more people:

  • As long as one copy of a hot news/blog went through the national Firewall, there’s no need to go over the Firewall again and again for it.
  • Most of the content suppression technologies are centered on National Firewall. Fighting with the national Firewall is tough, e.g. VPN traffic can get identified and blocked within minutes. Some countries now completely ban private VPN usage for individuals.
  • For governments, it’s not easy to control the p2p network within a country. IPFS has many features make it censorship resistant: resistant to node’s disconnections, intermittent availability, pluggable libp2p module can be easily customized to camouflage network traffic, etc.

The wikipedia snapshot project is great and inspiring. However, many of the wikipedia content in snapshot could be long tail content. Now thinking about using IPFS as a web proxy and provide the Google or Wikipedia search pages to users. We could fetch up-to-date (or almost up-to-date) info those users are seeking, and hot/popular content is guaranteed cached in the swarm.

Since IPFS directory is too slow and not safe for the web cache store, I also looked into IPFS Key-value Stores.

Orbit-db and Tevere are amazing. However, both of them are maintaining a separate local storage other than IPFS swarm. Both require IPFS pubsub to trigger log sync among peers.

A quick skim of go-floodsub code told me that it doesn’t use IPFS store for its messaging and peer management. Guess it’s going to have a hard time to scale and probably won’t work well with nodes have intermittent availability most of time (for better censorship resistant.)

It’s surprising to me that it’s not easy to build a Key-value store upon IPFS. Nobody ever suggested to support “content-tagging or tag-addressed”, in addition to “content-addressed” on IPFS?

I spent some time thinking about content-tagging while reading about IPFS. I start feeling the intense stare of “content-addressed” from the title of Benet’s IPFS paper (DRAFT 3) :slight_smile:

As I understood, if we make content addressable/searchable by its tag, 1) many of the benefits provided by the system will be gone, e.g. Tamper resistance, Dedup; 2) the conflicts caused by a arbitrarily named tag seems not manageable, especially when the data propagated through the network, and how the nodes decide which value should be the latest one. etc. 3) other issues…

However, I have a strong feeling that tagging is very useful and could be a great feature for IPFS even though there is no guarantee to make tagged content consistent across the system.

Are there discussions about similar features in the past? Thanks!

You should use an existing web scraper such as httrack instead of rolling your own.

How are you going to share this key?

No, there are a couple of big issues with this.

  1. Most national firewalls operate on a DNS level. You already have cachebrowser for those who try to be a bit more sophisticated.
  2. IPFS isn’t anonymous. If they live under an actual repressive regime, all it takes is a simple ipfs dht findprovs QmOFFENDING_CONTENT to put someone in jail or worse.
  3. It only works for static pages.
  4. You need nodes you trust to do it.

No, it’s not. Tor can evade GFW/Golden Shield with ease. What countries are you talking about where a VPN ban is enforced?

It’s trivial. You can use traffic patterns, or just block the DHT seed nodes. You could also create some custom implementation of the IPFS DHT that connects to as many nodes as possible and silently adds all the IPs it can find to the IP block list.

That just makes it fault-tolerant.

As I said earlier, they can just block all the IPFS nodes. The ones that really manage to piss them off might get a <img href="1.2.3.4/veryveryverylongstring.png"> tag inserted somewhere, where 1.2.3.4 is their IP. This has been done in the past, and overwhelming a residential connection isn’t very hard.

Why? With Wikipedia, you have the entire database. Build a simple reverse word index instead, this is good enough for most uses. Google doesn’t like people scraping their searches.

You can just use the IPFS directories for that. The key is the file name, the value is what’s referenced by the link in the DAG.

Because it’s a downright horrible idea. You can build a search engine on top, and there are decentralized implementations available in various forms. A simple reverse word index is good enough for most purposes and easy to decentralize.

No, the system will be gone. What you describe is susceptible to DoS attacks.

There was something called ipfs-search, but they shut down. Code available on GitHub.

3 Likes

I copied the IPFS_BASE_KEY into another node :sweat: So far it’s the only way (I figured out) to let many nodes updating the same directory. You mentioned “You can just use the IPFS directories for that. The key is the file name, …” How exactly to do it to allow many nodes to update the same file/directory name?

Distributed web cache

In my experiment, IPFS web cache works fine for dynamic web sites. It cached all HTTP requests including those web services calls.

About Censorship

  • Google “vpn bar in china” will show you recent developments about VPN ban in China and Russia.
  • Tor and VPN are not working well in China. I’d been in China for an extended stay in March, I had 3 VPN setups (IPSec, PPTP, OpenVPN) in different AWS regions and one paid VPN subscription (PIA). None of them worked out of box. Within minutes after a VPN session started, the traffic was cut out and dropped. You need to use shadowsocks or something similar to this http://blog.zorinaq.com/my-experience-with-the-great-firewall-of-china/ to camflage the VPN traffic.
  • Note that the Internet censorship is centered on GFW. Within the country, the VPN is fine and the censorship is mainly depending on the tight control of websites and social platforms.

Searchable tag in IPFS

Now I realize that IPFS knows only the content hash and not content value. I see why “the system will be gone.” However, I don’t thinking searchable tag is a horrible idea in IPFS.

In order to support tag, we need to support tag queries as a new message type than normal content queries in IPFS. We need to do some changes in routing. To support “tag” as a new storage object, we need to extend IPFSLink and IPFSObject and the corresponding protobufs. It’s relatively easy.

Guess we can live with a tag have many different values in the system, it’s up to users to find good use cases for tag. Hopefully people will find creative ways to use tags for their Apps, e.g. tag prefixing, tag namespace. Not used tags will be eventually removed from the system. In order to help publishing tags in the network, we might need to augment routing key to be a pair of keys (hash(tag), hash(content)), or even better (hash(tag), hash(hash(tag), hash(content))) to find a way to help tracking hot tag/value pairs and keep them in the system.

Guess one important use case for searchable tag is a native key-value store in IPFS.