We have an use case where we want to store a lot of data on very modest devices for various reasons. We make sure to create the data in such a way that it is sufficiently chunky. Our data is json data stored as dag objects, with a lot of potential for compression (your typical json data I guess).
I am tempted to just compress the data before storing and storing a compressed blob, but that would lose many of the advantages of ipld dag objects, namely the addressability of their content.
I think a much better solution would be to have compression at a file store and transport layer level. E.g. compress individual blocks in the file store once they exceed a certain size. Then communication between nodes via bitswap etc. could use this compressed data as well, provided that both sides understand the compression. Similar to content encoding negotiation in http.
The hash of the data would be computed before compression, so that the hash remains stable when switching compression formats and options.
I think for the data that is typically stored in ipfs, this could increase both storage efficiency and network bandwidth usage by a large factor. There would probably have to be a noop compression option for data that is already compressed, such as images or videos. But it should be pretty easy and cheap to detect if compression provides benefit when adding dag objects.
Is there any work ongoing on this, or do I have to resort to compressing individual blocks myself and lose the benefits of interplanetary linked data?
Yes, we are storing json as dag objects. However, I think for our kind of data (basically pretty regular telemetry / events) there is definitely a factor of 10 or more to be had using intelligent compression. And I don’t think this is very uncommon.
So I guess the best short term solution would be to encode the data using cbor / ipld and then compress it before storing/hashing it. Longer term it would be great if ipfs would deal with the last step, but I fully understand that there other more pressing issues (like making pubsub and IPNS fast and production ready )
We currently solved it by storing data as cbor and then compressing it with zstd compression. So the blocks are still dag-cbor but contain just an encoding + compression algorithm identifier and a blob. This is efficient since CBOR has a blob format.
CBOR, being a schemaless / self-describing format, is terribly inefficient, albeit a bit better than JSON. it compresses incredibly well. We get a factor of 20 for large chunks.
But the downside is that this is now opaque data that e.g. can no longer contain links that can be traversed using IPLD, which is really sad.
Technically, I think you could also introduce a compression-datastore wrapper layer that compresses and uncompresses things whenever they are written to the datastore or read out of it. That would be transparent to IPLD etc.
I had this discusion with Volker a year ago. I think it makes sense to compress specifically IPLD, since it is a format that benefits very much from compression, and you get transport compression for free and don’t have to transcode from on-disk compression to transport compression. If you just enable compression for your file system, you will waste a lot of effort trying to compress large assets (images, videos) that are already compressed and will not compress any more.
IPLD already has support for multiple formats, json, cbor and proto IIRC, so a very quick solution would be to just add e.g. cbor-zstd and be done with it. Which is basiclally what we are doing, but with this we no longer store open IPLD but blobs.
Regarding badger: we have had incredibly bad experiences with badger consuming all available memory and the repos becoming corrupted. The entire team has a visceral hatred for the thing now. We have just reconfigured several linux boxes in production from badger back to the file based store because of this.
We are all running the most recent ipfs release 0.4.22. Maybe this has gotten better since then, but it has gotten to the point where I don’t even want to try any more because I have been burned so many times…
I don’t have such experience, but I run on big, stable machines and badger can use as much memory as it sees fit. There have also been fixes on badger side. I do set the config to write the value tables and index to disk rather than memory on low-mem platforms.
The machines where we had these problems were Intel Core i5 as well as cloud nodes on AWS. So not super computers, but also not raspberry pis. We gave the cloud nodes more memory, but that just seemed to delay the problems a bit. I have to say that our use case is probably more dynamic than most, with blocks regularly being added and then removed again by gc.