IPFS for community-led research

I think there is a great opportunity for libraries to play a role here – libraries already exist as a place where communities hold the data that they care about. In the same way that I can ask my library to add a new book to their collection, we should all be able to ask our libraries to pin content on the library’s IPFS nodes. It’s basically the same model with books and IPFS content – patrons nominate content (a book, a dataset), the library considers the request and decides whether to proceed. If the library decides that the content is appropriate for its collections and if it fits within their budget, the library accessions a copy of the content into their collection. Once the content is in the library’s collection, the library is able to support access (make sure it’s available), discovery (make ways to find the content & provide ways to learn more about the content) and preservation (make sure it doesn’t get destroyed) of the content.

On a technical level, IPFS lets you reduce the problem of preserving data to this:

  • groups of people decide on a list of hashes corresponding to the content they want to pin – a pinset.
  • those people allocate storage to hold that pinset and pin it on ipfs nodes.

For storing the content, you have an abundance of options, such as:

  1. run your own ipfs nodes on your own hardware,
  2. use filecoin
  3. put it on a cloud service or colocated servers
  4. form a reciprocal arrangement to trade backups with other groups (like LOCKSS, DPN, etc)
  5. mix and match
  6. etc, etc

You can also optionally use ipfs-cluster to coordinate a network of participating peers who share the burden of storing an evolving set of pinned content.

The key benefits of IPFS are that you can move your data around, rebalancing and changing storage strategies as you see fit, without changing the links that point to the content – whether a researcher is serving the data directly from her laptop or from a big beefy server in some data center, the link stays the same. This means the location of the data is only a detail that impacts things like availability and protection from data loss. It doesn’t impact the way people link to the data, cite it, etc.

That benefit also extends to the fact that I can pull data onto my own machine if I want to – I don’t have to rely on a faraway server to give me access to the content and I don’t have to rely on someone else to keep the data around if I don’t trust them to keep it safe. Again, in that context the links don’t change. If the data stays the same, the link will stay the same regardless of whether it’s on a server run by the EPA or on an external hard drive in someone’s home.

2 Likes