I released an open source library for telemetry encoding and compression on IPFS.
It transforms an arbitrary array of json values into a columnar representation, compresses the columns, and stores the result on IPFS.
For telemetry data the compression can be very good, much better than just using gzip compression on the json array. Since there seem to be several people using IPFS for storing sensor data or other time series data, I thought this might be useful.
Here is a blog post describing how it works in detail.
We will definitely check it out. We have lots of small blocks in the applications we are developing for Actyx.
By the way: we have been bitten by the fact that you can create a block > 4MB, but not bitswap it. Nitpick: Maximum block size is poorly defined · Issue #4473 · ipfs/kubo · GitHub Led to a production issue this week. This should be better documented, or preferably it should not be possible at all to create a block that can not be sent over the network.
One thing I dislike about compressing data before storing as dag objects is that the content is then an opaque blob and no longer a meaningful IPLD object. So ideally IPFS should transparently compress data using a fast compression algorithm such as zstd before storing, and optionally send it over the wire in the same compressed form. If you hash before compressing this can be completely transparent.
You could create another IPLD Format which extracts the object (kind of like another view on the blob). I’m on the JS side of IPLD, so I don’t know how easy that would be in Go.
Could you perhaps describe more about this production issue you ran into? I would be curious to know more about the details since I suspect I will run into this issue in the next month.
I read somewhere that badger has high memory usage. Since we are running on low-power and low-memory arm edge devices this might be a problem. However, I will roll it out on a few cloud nodes and a few developer devices and see how it goes. Thankfully we got infrastructure to painlessly roll out new ipfs versions.
I think it is best to stick with the defaults. We are using a pretty niche (for now) technology, so at least we want to stick to the settings other people are using. We have just adjusted our chunking algorithm to never exceed 4 megabytes.
It was exceedingly simple. We generate data on multiple devices. Due to some strange circumstances and an application level bug, one of the devices created an IPFS dag node that was >4mb, and we were not able to get that hash on the other devices.
So the whole system got stuck and it took me a while to figure out what was going on.
Awesome thank you for the detailed response! Ah yea in that situation it would definitely make sense to stick with defaults. I may try some testing at some point on a private network if I have time with blocks larger than 4MB.
Yes, I know that it is converted into CBOR before hashing, so the canonical representation that is used for hashing is CBOR. But as a mere user of the IPFS API, I don’t really have to care. Which is nice. And it should be exactly the same for transparent compression…