Feedback appreciated: Creating a IPLD structure that supports high-throughput, parallelised write operations

Hi, excuse me if this is not the right place to ask, I’m new here.

At WeatherXM we need a way to store weather data on IPFS. Our problem is that we already have thousands of datapoints generated every minute, and this number will probably go up 10x in the next year. To solve this, we created a structure that looks like interconnected (block)chains, that has some interesting properties and is easy to be implemented using IPLD.

We call the structure ccDAG (not sure if it’s the ideal name), and I’ve put together an article to describe it:

Maybe the concept is not new and others are already using it, if so please let me know. Or maybe there is a fundamental flaw to the idea, in which case I would definitely like to know :slight_smile: before we start working on the implementation. In any case, I would appreciate any feedback!

1 Like

I havn’t thought enough about ccDAGs to give an opinion, just some observation,
The way we usually solve this is by having some consensus algorithm to elect someone who will be responsible updating the new head. See Filecoin, Ethereum, Bitcoin, … for examples.

1 Like

Sure. But then, your bandwidth is one: one node can update the graph at a time. What I shared allows us to update the graph in parallel, so if you have 10 nodes, that’s 10x throughput.

You can have (10 - O(log(base: Y, n: 10))) * X if you use a merkletree. You will also have overhead with the convergence, adding extra links.

I don’t even think the updates is what are expensive, if your code is optimized all you are doing is reading memory doing crypto (which modern CPUs are fast at) and write some memory to disk, some of thoses steps can be parallelized.

One thing I thought about is that, will this be your read end datastructure too ? Because AFAIT I would need to go through the history to do geo queries.

Filecoin for example it gives you a single root by the end, the history is accessible within that root somewhere.
But it also expose you the most updated snapshot of the state, the history also list what were the states back then, this means you can easily go back and do a query back then without having to traverse and resolve across way different time points.
Note: almost all of the state is aliased, that means if my account informations hasn’t changed in the last epochs, all the epochs point to the same account info CID (or however it’s called).

1 Like

Interesting. So what you propose is to have N nodes receiving data in parallel, buffer the data, and use an eligibility algorithm to decide who writes the next head. Right?

On the other question, yes, this will be the read end too. It’s practically impossible to optimise for arbitrary queries (by location, by station ID, by measurement ranges, etc.) so we assume that one (or maybe more, depending on the use cases) will read the data and index them according on their needs. It could be a huge RDBMS for someone who provides a generic API to our data, or a simple flat-file DB for someone who tracks the data of a single station and throws away everything else.

Keep in mind that there are no “state” updates as far as we are concerned. Every piece of data is important. So, the last state of station X is no less important (in general) than any other one. This is more like a log than an key-value store.