Back in November 2017 I started a thread on the topic of addressing large scale (genomic) data with IPFS: Addressing petabytes of genetic data with IPFS
The goal was to provide access to this data using IPFS content addresses. This was partially achieved using Filestore and a mounted FTP directory. However, this solution had an obvious network latency bottleneck.
In this thread I would like to discuss a more general solution. Many institutions provide open access to very large scale static datasets. In the vast majority of cases, this data is only accessible via location addresses. Convincing each institute to manage IPFS instances with Filestore extensions would be a costly endeavor. Similarly, incentivizing and organising peers to mirror for a significant duration is a difficult challenge.
As an alternative, peers could work together to curate a bidirectional mapping of IPFS content addresses to location addresses. This mapping could be consulted to find alternative means for retrieving data. In the event that data becomes absent from the IPFS network, a location address could be queried.
The mapping would require updating. This could be achieved by a group of peers working together. For instance, using a location address and hashing the retrieved data could validate a mapping entry.
I have begun working a project for managing and building a bidirectional map:
I believe that the ability to access and address data using IPFS content addresses would be of great benefit to the scientific community. Irrespective of the backend mechanisms which actually retrieve the data.
I hope that this community can cast a critical eye over my proposal. I’m quite happy to abandon it completely or change it significantly if I’m not on the right track. I would welcome all feedback.
Thank you for your time.
(The IPFS URL store might be related: https://github.com/ipfs/go-ipfs/pull/4896)
How is this fundamentally different from filestore + FTP mount? Of course, a custom ipfs storage module that spoke FTP directly would be nicer than an FTP mount, but the principle is the same. You maintain a mapping (hash -> FTP location) and advertise it to your peers, who can then get the content from you over bitswap
Thank you for your question.
This should be a more general solution than simply filestore + FTP mount. Some data providers don’t provide FTP endpoints. They may, for instance, provide HTTP endpoints only.
The filestore + FTP mount solution is costly to operate. The seeding IPFS peer must first fetch the data over the FTP mount before relaying via bitswap. Caching is not a practical workaround when very large scale datasets are concerned. However, if the mapping itself was available, each peer could simply download using the FTP location address directly if the data was absent from the IPFS network.
filestore + FTP mount is effectively a mapping which is difficult to transfer and manage across peers. Suppose a peer uses filestore to address every file in an FTP mount. This would require downloading every file accessible through the FTP mount and computing each IPFS content address. This computation could be very costly. As far as I understand, transferring this mapping to other peers is not currently possible. Nor is building the mapping collaboratively.
Furthermore, a peer which does not have access to a bidirectional mapping solution will have to download the data from the conventional location address as a precursor to possibly discovering that the data was available on the IPFS network all along.
I hope that I’ve addressed your question. If I haven’t please let me know. Looking forward to any further thoughts that you may have.
The cost of downloading-then-seeding has to be paid at some point. However, if there is enough load that it would be an issue all the people who have fetched blocks it and have a copy will help seed. This is a main strength of IPFS.
Is one of your main concerns the cost of building the mapping to begin with?
I am too new to IPFS to add technical value to this discussion, but I wanted to drop in and say that I think this use-case is well laid out and that I feel you have done a good job examining the needs of research data providers.
Thank you for sharing your work on this.
I have been thinking of applying a similar approach for satellite data. I am still to early with testing things out to explore this fully right now, but I have bookmarked IPSL so I can examine when I can get to it.
Thank you for your words of encouragement @7yl4r. I am constantly concerned about unwittingly wasting my time. Let me know if you think I can help you in anyway with regards to this sort of project.
@singpolyma Thank you once more for your response. Building a map can certainly be expensive for certain data stores (e.g. petabyte scale research datasets). However, the map can be built lazily. And this workload can be spread across many peers. Note that I don’t envisage a single global map.
Large scale scientific research data is currently addressed by a myriad of complex systems. Accession numbers, data set IDs, and other centrally controlled non-cryptographic complex pieces of data are all bundled together to traverse vast datasets. From what I’ve seen, each institution has created it’s own unique and complex organizational system. The complexities from these organizational methods spill out into other interacting systems.
IPFS could provide a vastly simpler and homogenized interface for accessing data held by these institutions. But this does not require mirroring the data. It doesn’t even require institutional participation. It should only require a bidirectional map.
I’ve spent a bit of time thinking about this and can say that I like the idea, and will most likely be implementing it in my own project, although porting it to golang first! My particular use case will be for temporary “short urls” to content people upload through my system. I like the idea quite a bit and think that it has many uses.
This is where i’ll be putting the golang port:
I’d be interested in exploring this topic more. Another possible consideration would be to explore where this mapping fits with persistent, resolvable identifiers like DOIs/Handles which exist in a kind of liminal space between location and content addressing. Although, from a long-term (measured in decades and centuries) preservation and persistent identification perspective, they may well prove to outlast pure content-addressing due to the likelihood of hash collisions etc.
As part of the EU-funded Freya project, I know that DataCite are actively looking at how to map data DOIs to actual content rather than http landing pages, so another possible avenue there.
An interesting topic. @rffrancon, I’d definitely join a call etc. on this topic.
@rffrancon : can you help me understand how to make this work across multiple clients?
My attempt to stumble through a use-case:
fileA is available on IPFS at hash
Qx... and at
- on client1:
ipsl links add --ipfs="Qx..." --https="usf.edu/fileA" to document the link
- copy the map hash from
ipsl config show on client1 to client2
- on client2:
ipsl links merge $hash_from_client1 to sync client2 with client1
Assuming this is correct, what automated method would you recommend for (3)? For ease of implementation on my end I am thinking of simply keeping the “latest” map hash in my product metadata db. But I suppose any key-value store of file-hash to links-dag-hash shared between clients should work.
As of v0.4.17, the urlstore is now available as an experimental feature. This allows URLs to be added to IPFS.
The short version for anyone wanting to test it is to enable it with
ipfs config --json Experimental.UrlstoreEnabled true
and add a file using something like
ipfs urlstore add https://ia800500.us.archive.org/1/items/mma_albert_einstein_pasadena_270713/270713.jpg
The resulting IPFS hash can then be used to retrieve the object at the URL.
For reference, the examples for how to use it came from the sharness test.