Hey everyone I have a long form (re)architecture question which I would love some feedback on if it’s even the right approach.
I’m from the dClimate team where we’re currently working on making climate data available and easily accessible through IPFS. We’re an offshoot that came about via Arbol (we’re incubated from the Arbol team and have some of the same cofounders) which actually has a case study on the IPFS docs page here https://docs.ipfs.io/concepts/case-study-arbol/. Our goal is to increase transparency in the climate data space so that everyone from construction companies, logistics organizations, and parametric insurances to local and national governments have as much information to make the best decisions possible.
As we have continued to scale the datasets on the Arbol side (into the dozens of terabytes) we came to the realization that a climate-specific DAO stewarded by domain experts would be best to produce and maintain high-quality data sources. With this expansion we also began to investigate what scaling our infrastructure would look like compared to our initial versions and I had some thoughts for a re-architecture where I would love to hear your feedback on whether it is the correct approach.
Government climate datasets are not only quite fragmented across dozens of FTP servers but data points are also sometimes updated after they are already posted, leading to a need to often revise older data points, which makes parametric insurance a nightmare to deal with if you don’t have a data trail. We’re leveraging IPFS to create that data trail so that anyone can publicly verify and confirm for themselves what the data previously was and what it is now, while also making it easier to query these datasets (spatiotemporal).
Problem:
The way we currently do this is by uploading the initial dataset of already available historical data onto IPFS, then, upon each new update that occurs on government servers, we add a new item to IPFS which contains the new and/or revised data with a reference to the previous “genesis” IPFS CID for that particular dataset. Upon a new update from the government sources, we repeat this process, this time referencing the last CID and so on. This creates a linked list of all new data along with all the previous data updates. To make it easier to query this IPFS “data structure” we’ve created a python client that anyone can run locally or on a server, however, the issue is that this client must traverse the entire linked list and reconstitute the entire data set in memory (via a reduce) even for a small slice of data (in time). As we intend to create a climate data infrastructure that anyone can use (even hobbyists), the increasing space and memory demands for such an approach (as new data is added) appear to create constraints that will only increase over time crowding out the very people we want to empower.
Proposed Solution:
As a result of these constraints, I looked a bit more into IPFS and what data structures/models would work well within the IPFS ecosystem. I wondered if combining the concepts of Merkle DAGs with IPLD schemas would be a more scalable approach (e.g. git model). Instead of the current linked list, datasets would be preprocessed to fit into predetermined “buckets” (hourly, daily, weekly, monthly) depending on the specific dataset and a root node of a Merkle DAG would reference these bucketed datasets (each Merkle DAG root would be an IPLD Schema). If any particular day were to change, the root node would reference all the other days except for that day that was changed, where it would point to the new one, and so on. The days themselves could also have some metadata (if so desired) so that a user could traverse through the versions of a day. So you get a data trail not only in the root nodes of the Merkle DAGs (which reference each other to genesis) but also on the data bucket side (option B on diagram). I’ve attached a diagram for comprehensibility where the top structure is the current implementation and the bottom is the new proposed approach.
I feel like this branching model affords the community the ability to easily create forked (derivative) datasets while also making it trivial to query based on time slices, as you no longer need to reconstitute the entire linked list and can query in parallel for the days you care about.
With that said, I was wondering if there are any constraints from a protocol level such as folder size (whether total size or number of files) or if there are any particular resources that can help us avoid any pitfalls if this approach is deemed worthy to explore. Our ultimate goal is to also replicate this data onto Filecoin to create a canonical permanent climate data “ledger”. If anything above does not make sense please let me know