[deprecate] IPFS ❤️ human readable URLs - Use a DHT to upgrade URLs to IPFS-CIDs

RubenKelevra · February 13, 2021, 12:56am

Hey guys, Hope you’re doing all very well!

We have the great features DNSLink and IPNS, which allows us to upgrade domains in the browsers (and URLs to a degree) to IPFS-CIDs which the browser can access via the IPFS network.

But there are certain limitations that currently hinder smooth transitions between a regular web server to a site running fully on IPFS. The main limitations come from queries behind the path part in the URL as well as the missing support for any non-http(s) scheme.

The idea to fix this is simple:

Hash the URL
Write the CID-informations in a DHT with the URL-Hash
Sign the information with the IPNS key for the Domain
Store the IPNS key for the Domain in the DNS

To avoid that false information overflow the DHT, all information added should be individually verified by the nodes storing it, so they should ask the DNS system for the IPNS key and check the signatures before storing it and offering it to the network. Also the timestamps should be checked if they are within the clock of the host. This way no entries for the future or the past can be published.

Rationale

We are currently expecting that the Web has only http(s) URLs without a query part and can’t support the rest.

Using hashes to inform a client about the availability of the information in the IPFS-network reduces the needs for workarounds and helps to reduce the complexity of a transition between web servers and IPFS-libraries.

Since this approach isn’t limited to the http/https scheme we can extend this in the future to other URIs.

Additionally, we can use the URIs to resolve to p2p service inside the IPFS-network. This allows us to extend the clients in the future route something like IRC, SIP, or SNMP traffic to a p2p service inside of IPFS instead of natively over the internet. This approach allows for interesting failover, mobility, and encryption possibilities while also extending the usability of IPFS beyond storing data.

Technical specification

I haven’t given the technical details that much thought yet, so sorry for all the rough edges here. I just want to outline how it might work, not how it should work!

Redirects

Redirects are often used in web servers to move clients from old URLs to new ones or to move certain links to other locations.

IPFS could support feature this natively, to avoid that a web server has to be contacted to do the redirect before the client can upgrade to an IPFS-path.

An example of how the data stored in the DHT could look like:

<ipns-pubkey>
---
type: "redirect"
from:
  scheme: "http"
  authority:
    host: "example.com"
  path: "/old-link/"
  query: ""
  fragment: ""
to:
  scheme: "ipns"
  authority: "example.com"
  path: "/home/"
  query: ""
  fragment: ""
valid-since: "2020-12-12T00:00:00Z"
valid-until: "2021-01-15T00:00:00Z"
...
<signature>

Wildcard URLs

If the DHT doesn’t contain a valid result for the full URL, the client might drop certain parts of the URL to find a matching entry - for example, the fragment-part might not be necessary to fetch this data from IPFS. As a last resort, the client can ask the DHT for entries just for the scheme and authority part of the URL, to find matching entries.

This opens not only the opportunity to specify the same information for multiple URLs but also to specify a 404-page if the URL isn’t valid.

An example entry for a redirect with URL wildcard:

<ipns-pubkey>
---
type: "redirect"
settings:
  from:
    wildcard-path: true
    wildcard-query: true
    wildcard-fragment: true
from:
  scheme: "http"
  authority:
    host: "example.com"
  path: "*"
  query: "*"
  fragment: "*"
to:
  scheme: "ipns"
  authority: "example.com"
  path: "/404.html"
  query: ""
  fragment: ""
valid-since: "2020-12-12T00:00:00Z"
valid-until: "2021-01-15T00:00:00Z"
...
<signature>

This entry would be published in the DHT with the hash of the scheme and authority, to avoid having to publish this under every possible URL-hash

Here’s an example entry that identifies the source while ignoring the fragment part:

<ipns-pubkey>
---
type: "cid"
settings:
  from:
    wildcard-fragment: true
from:
  scheme: "http"
  authority:
    host: "example.com"
  path: "/welcome-page/"
  query: "moreinfo=false"
  fragment: "*"
content:
  id: "QmPZ9gcCEpqKTo6aq61g2nXGUhM4iCL3ewB6LDXZCtioEB"
  address-hint: [
    "/ip4/6.7.8.9/tcp/46147/p2p/QmZHrtsCdrkfTkq56Q96vCbN16rEkzWogN7P58w9ytgWAj",
    "/ip4/6.7.8.9/udp/47187/quic/p2p/QmZHrtsCdrkfTkq56Q96vCbN16rEkzWogN7P58w9ytgWAj",
  ]
valid-since: "2020-12-12T00:00:00Z"
valid-until: "2021-01-15T00:00:00Z"
...
<signature>

CID entries for URLs

As already seen above the content of an URL can be linked to a content-id while adding optionally address-hints to accelerate further network operations - if those nodes are online.

The simplest entry for a file stored on an FTP-server would look like this:

<ipns-pubkey>
---
type: "cid"
from:
  scheme: "ftp"
  authority:
    host: "ftp.example.com"
  path: "/demo-file.txt"
  query: ""
  fragment: ""
content:
  id: "QmPZ9gcCEpqKTo6aq61g2nXGUhM4iCL3ewB6LDXZCtioEB"
valid-since: "2020-12-12T00:00:00Z"
valid-until: "2021-01-15T00:00:00Z"
...
<signature>

Note that specifying all parts of the URL is mandatory.

IPNS entries for URLs

Apart from permanently static files, the user might want to specify a dedicated IPNS-key to publish new versions of a file under the same URL without having to update the URL-DHT entries every time.

This type of entry allows just that:

<ipns-pubkey>
---
type: "ipns"
from:
  scheme: "ftp"
  authority:
    host: "ftp.example.com"
  path: "/demo-file.txt"
  query: ""
  fragment: ""
ipns:
  pubkey: "QmSrPmbaUKA3ZodhzPWZnpFgcPMFWF4QsxXbkWfEptTBJd"
valid-since: "2020-12-12T00:00:00Z"
valid-until: "2021-01-15T00:00:00Z"
...
<signature>

Storing the information in the DHT

I think it might be best to create CIDs from these data, with something like a folder/file structure to make updates space-efficient even with many many elements stored under one hash and the clients having to update the data to fetch the next URL from the network.

This way the DHT could either be asked for the current CID or for the CID and the data in one request - if the node has zero information yet. This reduces the number of roundtrips necessary to fetch the first byte of content, while updates with many items would remain very efficient since the CID could be fetched via the regular network - having the DHT nodes holding the data temporarily like it’s pinned.

lidel · February 13, 2021, 12:58am

If I understand your ideas correctly, you want to create a separate DHT for the web where:

opaque URIs (foo://bar/buz?query=val) are mapped to content-addressed paths (CIDs or path under IPNS key)
(and optionally) content-addressed paths are mapped to some URIs (as fallback if no providers)

And with for this web specific DHT:

have protocols for publishing/querying this DHT
sign records with libp2p-key
verify that libp2p key is present in DNS TXT record at the time of storing record (DHT node) and lookup (DHT client)

@RubenKelevra If that is the correct description?

I feel I’m missing something, what is the value added on top of existing DNSLink (required in both cases), apart from publishing libp2p pubkey and signing DNSLink with it (which we can add without the need for inventing new DHT)?

RubenKelevra · February 13, 2021, 1:00am

@lidel that’s correct. The idea is to add more flexibility and smooth out the bumps for a better transition between web 2.0 and web 3.0.

Say you got a page that uses links with query-part you have a hard time switching to IPFS without running it under a dedicated subdomain since links like “https://www.domain.tld/videos?id=332” would just break if you add a DNSLink to your domain.
With an URL database, you could either create a redirect to an URL without the query part, like “https://www.domain.tld/videos/id/332/” or you could use the same URL and just add the content-id to it.
Another possibility is to be able to add content-ids to URLs with schemes currently not supported, like an RTSP stream of a static file.
It also avoids that you have to maintain a folder with all the data for a domain since you can just put links to certain CIDs in the DHT instead.
With long paths like “https://www.domain.tld/videos/building-6/camera-4/2020/04/10/time/22/10/” you also get a nice speed benefit because there’s no need to do request, parse, request, parse… for each level in the path of the URL.
It enables users to put meta data behind URLs for identity purposes, like GPG keys for email addresses: mailto://domain.tld/user@domain.tld/gpg → CID. Which looks granted a bit clumsy. But noone holds the application developers back to use gpg://user@domain.tld instead. Which would work fine, since we just use the host part of the authority for checking the authenticity.
Addressing books for example with their ISBN urn would also be possible, via a domain as an authority of course: urn://archive.org/isbn:379200027X
In the future we might want to extend this, to be able to connect URLs to dynamic content through IPFS via the currently experimental Libp2p stream mounting option.

RubenKelevra · February 13, 2021, 1:00am

I thought a bit over the current solution and I think the DHT nodes should NOT pin the data behind the CID, even after it’s checked.

There’s a chance of misuse, where someone might use the redirect URL function to store data distributed over all nodes in the DHT, as there’s basically no limit in how many items can be stored and you can also enumerate blocks of a file, like http://domain.tld/1, http://domain.tld/2 etc.

This part of misuse might be present for all kinds of DHT, but in this case the amount of storage which might be used is significantly higher.

Therefore I think we should just fetch the CID, parse it, validate it and don’t pin it. This way the garbage collector could just clean the space up, when needed.

You can hold your CID as long as you like, but the DHT nodes might not.

Since a cleanup by the DHT nodes would result in you having to provide the data again, you can basically not use this function for storing any data reliably.

lidel · February 13, 2021, 1:01am

Once again, this would work only if you are publishing URL2CID records for a domain which has a dnslink=/ipns/{key} and sign those URL2CID records with the mentioned key.

I struggle to see the value added by this complexity:

Remain skeptical if you would get any meaningful performance boost from this, as this additional DHT lookup will be most likely more expensive than simply traversing the DAG to resolve path to a file while you are already connected to a peer that has the root CID.
Redirecting query params on HTTP gateway can be handled by simpler means, like a flat manifest file: #6214
Speeding up provider discovery can be implemented by publishing dnsaddr TXT record and making go-ipfs preconnect to multiaddrs as one of discovery methods for DNSLink names

You could experiment with this URL2CID DHT idea in a separate project that acts as an HTTP proxy in front of go-ipfs and see if you can produce some benchmarks to prove me wrong.

I suspect we can do most of the things you mentioned via less complex means and existing DNSLink+dnsaddr

zacharywhitley · February 16, 2021, 5:19pm

I don’t think @RubenKelevra made any claims about performance improvements for his proposal.

github.com/ipfs/kubo

Proposal: manifest file for the IPFS gateway

opened 05:19PM - 13 Apr 19 UTC

closed 06:52PM - 18 Nov 21 UTC

ItalyPaleAle

kind/enhancement topic/gateway status/accepted

## Problem Some people (myself included) are running websites through IPFS, a…nd relying on gateways (like Cloudflare's) to serve them to users via HTTP(S). The current IPFS protocol has some limitations when doing this. The biggest one is the inability to set custom headers that a HTTP web server might need, starting from `Content-Type`. ## Proposed solution I propose we create a manifest file that can be stored inside each folder added to IPFS. The manifest file [could be] a YAML (or JSON) document, for example called `.ipfs-gateway.yaml` and could contain additional metadata that is relevant to IPFS gateways only. For example: ````yaml # Version of this manifest format version: 1 # Add rules to specific files/patterns files: - name: 'logo.svg' contentType: 'image/svg' # Set the Content-Type header - name: 'images/*.dng' # Use glob-style patterns contentType: 'image/x-adobe-dng' contentDisposition: 'attachment' # Set the Content-Disposition header - name: 'index.html' contentSecurityPolicy: '...' # Set the Content-Security-Policy header etag: 'abcdef123' # Set the ETag header contentLanguage: 'en-us' # Set the Content-Language header - name: 'redirect.html' redirect: 'other-page.html' # Set the Location header (requires a 3xx status code) # Configure additional options for the gateway options: # Redirects HTTP -> HTTPS traffic alwaysUseHTTPS: true ```` When the IPFS gateway serves a folder, it needs to check if there's a manifest file, and apply the rules configured in it. The manifest allows adding certain HTTP headers to files served by the gateway. We should explicitly whitelist the allowed headers, as in shared gateways there could be issues with other apps (e.g. imagine someone deployed an app that enabled HSTS, and that would impact the entire gateway). The manifest file should be placed in the root of the folder added to IPFS. Since it's just another document published through the IPFS network, a change in the manifest file would result in the entire folder having a completely different hash, and this is by design. ## Alternative proposals There have been many users asking to implement custom metadata/headers for files inside IPFS, including on https://github.com/ipfs/faq/issues/224 I believe that, while the ask was for the ability to add metadata to files published on IPFS, in reality what users want/need could be better satisfied with a proposal like this. Compared to adding support for metadata in IPFS, this proposal has many pros: - It's easier to implement as it doesn't require changes to the IPFS protocol, and users on old versions of IPFS would simply ignore the manifest. - The manifest file would be picked up by the IPFS gateway only, and that's good. Users who just request documents via the CLI don't need metadata anyways. - Adding metadata to each file is cumbersome when you try to add/pin files using the IPFS APIs and CLIs. For example, you couldn't just do `ipfs add -r folder/` anymore, and the CLI would become complex fast. - The manifest file is extendible in the future, should we want to use it for other configurations for the gateway. - The manifest file becomes part of the folder published on IPFS, so a change in the document would lead to the folder having a completely different hash. I believe this should be considered a feature, as it maintains the immutability principle of IPFS. - Lastly, manifest files can be checked into source control together with the web app published on IPFS. The cons: - It requires adding another file to the folder published - It doesn't work for files published on IPFS that aren't part of a folder

I don’t think there was anything in the proposal about speeding up provider discovery either.

RubenKelevra · February 28, 2021, 5:47pm

Hey guys,

just wanted to let you know that I want to deprecate this proposal. I’m currently working on a secondary version that focuses more on URIs than URLs.

@lidel I see the points in decreased performance, but on the other hand, you’ve not shown any way to convert an URL like this https://www.domain.tld/videos?id=332 into a CID.

Is there one?

@zacharywhitley Well, you’re right that @lidel is focusing pretty much on the performance aspect, but he’s right about that. Performance is lower for simple URLs. Probably enough to cause issues for regular users.

But on the other hand I like the flexibility of this approach, so I try to focus on URIs in the next proposal which might be more interesting than URLs. While they are still included.

lidel · March 5, 2021, 2:58pm

If I understood it correctly, person doing this is trying to replace dynamically generated responses with static content that was put on IPFS by crawling preexisting old dynamic site and putting static output for each query on IPFS. And each update requires re-crawling.

In my mind this means they did not made their website independent from the backed: they are faking decentralization, because the source of truth is still the old app that generates output based on queries.

Personally, I’d rather not see people wasting time on partial solutions like this, and instead move to model where source of truth is not dependent on some backend service.

Is there one?

Right now, if someone wants to put their website on IPFS, and they want to keep the old URLs working, they need to make sure the static HTML+JS at /videos is capable of acting on ?id= query or fragment parameter in the URL. This is trivial to do in JS by inspecting window.location object and does not require wasteful creation of oid-based variants of /videos file.

In addition to JS route, I hope to have alternative in form of a manifest file where you can define redirects from legacy URLs to new paths, but we don’t have that yet.

RubenKelevra · March 9, 2021, 3:05pm

Ah right - that’s a pretty neat way to deal with that

Topic		Replies	Views
IPFS Records for URN/URI resolving via a DHT Protocol	11	1706	March 25, 2021
CID concept is broken Ecosystem and Usage	68	4293	February 2, 2022
Web browser with integrated IPFS node/support for browser cache? Ecosystem and Usage use-cases-and-apps	7	2209	April 1, 2018
Thoughts on using IPFS for serving static websites Ecosystem and Usage http-gateway	20	1494	January 25, 2021
Work-plans for kubo, helia, & other Shipyard IPFS projects in 2025 kubo , helia	12	329	December 12, 2024