Thoughts on using IPFS for serving static websites

I recently wrote an article about the good and bad points of serving static websites on IPFS and thought the community would be an interesting place to discuss. I won’t copy the contents here but you can read it here:

(If people are interested I’m also considering writing a tutorial for creating production-ready static sites on IPFS)

5 Likes

Good article. Thanks for sharing it here. I wonder what the guys in here will say about the issue of getting the MIME type of data without having to put a file in a directory. I don’t know of any solution, but hopefully someone can clarify if this is still true. Having to put every file in a directory to preserve it’s mime type just doesn’t seem ideal, but maybe there’s much less overhead than it seems.

2 Likes

I would love to have a mime type included. Even relying on the file-extension is hinkey. Sure, it is unlikely that a gateway doesn’t have .js map to application/javascript in the near future but it does cause an issue with less common or new file types (for example it took a while for most webservers to learn a mime type mapping for webassembly files).

In the general case if we rely on the gateways to add critical metadata to data in IPFS we are creating a compatibility issue, both in the near and long term.

I’m not too worried about the performance, but if we could inline the mime type into the data blocks it would be very cool.

1 Like

Well “file systems” themselves only have the file extension as the way to determine types (no mime types/names), so it’s not that bad, but I know what what you mean.

As long as there’s minimal overhead with everything being a directory, maybe the upside is everything’s always “appendable”. That is: you can always publish an update to a folder (using IPNS) with new files added, which would be impossible if everything was “files” instead of “folders”.

It isn’t that bad. But generally for local filesystems the application knows what type it is, or the user can select to override the choice detected by file extension+content sniffing. There are also more modern operating systems that do store the mime type explicitly and I would argue that this is a better design :slightly_smiling_face:

And when talking about HTTP gateways browsers do really care about the Content-Type header, and they don’t care at all about the extension (except for legacy content sniffing). So in this context it is actually somewhat bad.

Now you can argue that web browsers generally know what type of content they are requesting anyways. But that road leads to legacy and security hacks that aren’t going to change any time soon. Plus there are nice benefits, such as being able to request and display an image without needing to know what type it is upfront.

I’m not too sure I see your benefit about being able to add more files. IPFS is largely built on immutable content-addressed data. You are right that with IPNS you can push “updates” but that would be more an argument for not allowing IPNS to sign directories, only files. I think that if you are signing something you plan to update sticking to a directory so that you can add more things makes sense. However for a javascript file it doesn’t really make sense to “add more”. Sure, an updated file can import other files, but the browser still only cares about the first one.

Thanks Kevin. That was a great read! Can you elaborate on what you mean by the following?

… If the IPFS native protocol takes off this could also become a global benefit without relying on third-parties due to IPFS’s peer-to-peer sharing.

What I meant by that is currently the “free CDN” feature is based on the existence of public gateways. This isn’t really a decentralized approach and relies on the good will of those offering the gateways.

However if a large majority of viewers were using the IPFS protocol directly they would also help serve the site (in IPFS’ default configuration) to other IPFS viewers so this feature would become available without public gateways.

So this is a problem founded in adoption that should resolve itself in time if IPFS becomes more popular.

1 Like

Ah, you mean the bitswap part, right? As you download a resource you serve chunks to other nodes?

That is correct. makethiscomment20chars

That is just lame and abusing of good intending public gateways imho.
People should be incentivized to install IPFS Desktop (or just IPFS and IPFS companion). And if they persist in abusing the gateway, throttle throttle their speed to 1 byte per second or so.

Now i’m no IPFS author or dev, but the “free cdn” in your article stings me a little.
IPFS is not a free dump whatever you want CDN. I think the requests arriving through the gateway aren’t even super long lived either (if someone could confirm this?).

The pesky thing is that people read your article and write their own articles about it (or youtube videos) and then subsequently blame IPFS for creating a mindset of it being a free unlimited dump all you can CDN and then be disappointed when it’s not working like that.

1 Like

If IPFS gets to have native browser support, the vast majority of people will have no clue, nor care, that they are using IPFS, anymore than most people care about Ethernet, IP, HTTP, etc. The only people that will need to interact with it directly will be web developers, system administrators, programmers, and similar types of people.

Until we get native browser support, most people not in the IPFS space will only ever interact with the gateways. There needs to be a balance between using the gateways to promote IPFS and prove it works, and abusing the gateways and having them drop out or block entire categories of content.

1 Like

Welcome to the IPFS community, kevincox, and thank you for your link and for considering creating a tutorial on how to create a static website.

If you do go ahead with that tutorial, I hope that you do it in such a way that you can publish one hash for the website, for people to find you, which remains constant, regardless of edits you subsequently make to the pages there.

By default, Brave Browser already offers to install IPFS companion.

If I understand the question correctly this can be done for any IPFS website by publishing the hash via IPNS (which gives you a stable has for updating content).

While this is definitely an improvement and something to be glad about, this is not what I meant by “native browser support”. Brave, Firefox, and the other browsers have native support for HTTP(S), meaning that you don’t have to install an extension for it to work; support is built into the browser and is tightly integrated into things like the browser cache.

No, filesystems themselves don’t only have the file extension. Classic MacOS had ‘OSType’, OSX now has UTIs, and most modern filesystems support some form of “extended file attributes” (xatrr).

Metadata for IPFS files has many use cases would be beneficial for browsers, search, filters, ownership/authorship and more.

Perhaps it could be implemented via a paired metadata file optionally stored with each file? That would not affect any of the core IPFS just add a layer on what exists.

2 Likes

File Systems have a “convention” for how to designate file types, using the extension part of the text of a filename. There’s nothing to stop people from using that in MFS because afaik files in MFS can have names, but I agree with your larger point that it would be nice of there were an officially designated MIME type in some metadata somewhere, and I’ve wished for this for a long time. I think the goal with MFS was to be compatible with Linux file systems so introducing new metadata wasn’t on the agenda. I may be wrong.

Nope, they don’t. That’s the job of additional wizardry to request the stat data from a file and from that data determine the type, extension, … Even extensions aren’t that simple. Just think about detecting “.tar.gz” but not “my.random.file.txt” as “.file.txt” but as “.txt”. This is where you get into the world of “mime databases” where one can search that database by filename (extension included) which returns a mime-type which then finally allows you to get the extension the file has. This is where classes like this come into play.

Filesystems themselves have no knowledge about their extension or their type of content.
See this for reference. Now that is a generalized representation (statx), there might be filesystems (ntfs, btrfs, apfs) that have more attributes available but then you’re into filesystems specific magic.

Extended attributes are a different beast altogether. I’m assuming apple does something similar with apfs. You can only really use this if most tooling takes them into account. That is slowly getting better to it being quite usable now on the linux world.

I think this all again, for IPFS purposes, boils down to the conclusion that there is a need to have a metadata layer on top of IPFS. Or rather, tooling on top of IPFS that allows one to add metadata in the main IPFS DHT. And that alone is complicated. For example, how do you say that File X has metadata and what’s the CID to get that metadata?

One way i can think of is creating a wrapper library on top of IPFS.
In that library you’d change the way files get added. It would add a file, yes, but it would also add metadata in - for example - a format like this:

{
  "fileCid": "<cid>,
  "metadataCid": "<cid to metadata>"
}

(yes, that blob can also contain all the metadata, doesn’t have to be a separate <cid>)

That small blob of json get’s added and it in turn gets a <cid> too. That <cid> is the one the user gets as the file. That one is what’s being send back and forth. Now if that tooling gets a download request from this “metadata cid” then it itself should call <cid>.fileCid to get the actual file <cid> and download the file.

This is just a braindump-style idea. It’s likely going to be far more complicated then this. Something like this would be required to implement IPFS as a “filesystem”. Sure, you have the current MFS and whatnot, but those only give you the filesize and name (unless i’m super wrong?). You’d need more to show it nicely in a file explorer and allow for operations like sorting by date/type. You’d need a “stat call for ipfs”, which would be this metadata blob.

“By Convention” literally means that the File System itself is just seeing a filename as a string. Humans and/or Software use the extension naming convention to assume a file type (mime). I seriously doubt anyone misunderstood me. lol.

This is assuming that we want more metadata (which I think we probably do) I think we should consider how it grows over time and support efficient streaming of large directories.

Indexing the files is interesting. Right now there there is a de-facto index on name and nothing else. (There is metadata for size but not sorted). However for different use cases different indexes are optimal.

  • A user may wish to sort by name, size, last modified or any other attribute.
  • A file manager may want to display name and size.
  • A file open dialog may wish to filter by file type.

At the end of the day we have a table of metadata and we should consider the possible layouts. I think we want to have indexes for sure. Otherwise you can’t do streaming with a specific sort order. (For example show me the newest files, or biggest files)

There are two main approaches for any sort of table

Row-oriented layout

This is the most obvious solution. The main “directory” CID points at an object with links to the metadata entries as well as “indexes”.

entries:
  - cid: QmQ8Qc8Y85YGrQ1zCQdn8DGZj1BP4bsupQsXBw6xh2D4PL
    name: 2020
    size: 308681
    mime: inode/directory
  - cid: QmUj45TFLeCjNzF9kWXv88LVzGvAJTqa9HYHTFEQMhLCKY
    name: 2021
    size: 11827
    mime: inode/directory
  - cid: QmXvBPGMU1Dk8EwBPL5Q629AuD1fVdRF5vNCpqrTahRTVQ
    name: index.html
    size: 1783
    mime: text/html
indexes: # One index can be omitted by defining the sort order of `entries`
  cid: 
    QmQ8Qc8Y85YGrQ1zCQdn8DGZj1BP4bsupQsXBw6xh2D4PL: 0
    QmUj45TFLeCjNzF9kWXv88LVzGvAJTqa9HYHTFEQMhLCKY: 1
    QmXvBPGMU1Dk8EwBPL5Q629AuD1fVdRF5vNCpqrTahRTVQ: 2
  name:
    2020: 0
    2021: 1
    index.html: 2
  size:
    1783: 2
    11827: 1
    308681: 0
  mime:
    inode/directory: 0
    inode/directory: 1
    text/html: 2

One basic optimization would be ordering entries as map of name to remaining metadata, much like sharded directories are today. This makes lookups fast. However you would also want to look up by index, otherwise you would need to have the index contain the filenames which could be expensive.

Column-oriented Layout

It would also be interesting to consider a “column oriented” table of metadata. This is similar to the row-oriented but instead of separating each enposixtry contiguously you split each column into a different “datastream”

columns:
  cid:
  - QmQ8Qc8Y85YGrQ1zCQdn8DGZj1BP4bsupQsXBw6xh2D4PL
  - QmUj45TFLeCjNzF9kWXv88LVzGvAJTqa9HYHTFEQMhLCKY
  - QmXvBPGMU1Dk8EwBPL5Q629AuD1fVdRF5vNCpqrTahRTVQ
  name:
  - 2020
  - 2021
  - index.html
  size:
  - 308681
  - 11827
  - 1783
  mime:
  - inode/directory
  - inode/directory
  - text/html
indexes:
  # Same as row-oriented

This solution is fairly naive. It probably makes sense to group cid and name together as lookup is probably the most frequent operation.

Column-oriented also provides the possibility of a lot of compression options. As a lot of the values will be the same or similar. (for example a directory with everything modified at the same time, with the same few mime-types or HTTP headers.

Comparison

For this comparison I’ll assume that both are sorted/sharded by filename and the column-oriented interleaves the cid and name columns.

cat

These are fairly similar, however column oriented is slightly better because it pulls in less irrelevant information. (for example the file types aren’t pulled in) This means that if you are doing repeated cats your metadata will be hotter and you will get better efficiency.

stat (full)

This is a clear winner for row-oriented. It pulls in basically the minimum blocks and requires no “joining”. column-based pulls in a block per metadata requested leading to more data transfer and tail latency.

stat (partial)

This is unclear. It depends on how many metadata columns there are and how many you need. For repeated stats column-oriented will come out on top because you are puling in only the relevant types of data. However for cold lookups the row-based will likely access less blocks for any reasonable amount of metadata.

Furthermore most software assuming POSIX semantics will do a “full” stat which pulls in a lot of fields.

list (full)

This is a win for column-oriented as it only pulls in the metadata it needs. row-oriented will access more blocks.

list (partial)

When listing a small part of a directory row-oriented is better because it will access less blocks. Both are the same index usage (if required for sorting) however column-oriented will access N blocks if N bits of metadata are required (see stat) whereas row oriented will usually only need one block to get all of its metadata. As you list a higher portion of the directory this will improve and switch to column’s favour (assuming that you cache the blocks that you read).

Conclusion

It isn’t completely clear. The biggest problem with column-oriented is the “wide stat” which is assumed to be efficient by the POSIX APIs and a lot of software does have POSIX assumptions and will probably continue to for the foreseeable future. On the other hand consumers of HTTP APIs are getting used to selecting which fields they wish to have returned.

One interesting thing about column-oriented is that because metadata is separated into different blocks by type directories with a subset of metadata similar (for example different last-modified times) will be very similar, only that column and the root node will differ. This means that adding, removing or changing one type of metadata will not require a completely new set of blocks. However on the other hand adding or removing files will affect more blocks.

Either way I think it is clear that we should consider a solution for more metadata. Is there somewhere more targeted that we can discuss this?