Is there an implementation of IPFS that includes erasure coding like Reed-Solomon right now? (I have scan the issues and forum, get the answer was NO in previous years)
BTW, I am a senior student and my supervisor hope me to merge Reed-Solomon into one implementation of IPFS(like Kubo) to solve the problem of single node cold data storage. Is this achievable?
Not that I know
This is not how we use IPFS in the wild today, storing files is voluntary, nodes don’t randomly start storing others peoples files.
If someone want to store the file they store the blocks themselves.
I’m not sure what Reed Solomon would handle verifying the hashes or why it would be use full here.
Thanks for your reply!
The reason why I want to use Reed-Solomon is that IPFS maybe not a suitable storage method in some corner case(The scenario that someone use IPFS to store one file for years, but the hard drive where the data stored breaks down, since no one had ever pulled it, there was no copy anywhere else, so data was lost forever).
You are correct that the best way to avoid such a scenario is to make a replica locally. However, if IPFS supports erasure coding, creating a replica can be easier.
With erasure coding, users can use less space for a large amount of data fault tolerance. If I specifying three nodes to store data, traditional replication requires 300% of hard disk space, but through Reed-Solomon, nearly half of the storage space can be reduced for a similar effect.
Reed-Solomon layer over IPFS · Issue #196 · ipfs/notes · GitHub
This old issue also mention this optimization. But maybe the priority of this idea is very low.
I’m considering that if I do it myself, how many time will I spend, is it possible to implement Reed-Solomon to Kubo in some way?
P.S. I’m a senior student, had implement Raft with Golang and have a basic knowledge of distributed systems and store, but I’m a newbie to IPFS.
I don’t think this project makes a lot of sense directly in Kubo.
IPFS is not that great as an everything file storage solution.
The main point is content addressing, (using hashes as addresses), this allows us to decouple content from, data transfer and storage.
Then underneath this the clients don’t need to care about how their files are stored and transferred because they rely and verify the hashes, then where the bytes came from we can just not care.
This allows other layers to provide their own solutions which support content addressing without having to update the trust model.
TL;DR: if I were you I would work on a new daemon that provides data over IPIP402 trustless gateway and bitswap so Kubo and Helia can download data from you, and then you can build your erasure coding network between your nodes.
I see. Thanks for your kindness and useful solution. I will take it into consideration!
Hey @loomt ,
IPFS storage-unit is the block, which is stored in a key value store keyed by CID. IIRC, erasure coding in the context of IPFS could split the block into multiple parts to obtain redundancy. i.e. we would have the equivalent of storing the block replicated 5x times, but only need 4x amounts of space (or similar, correct me if wrong).
In a machine, normally replication would be done at the filesystem level layer (i.e. raid)… you could think of introducing replication at the ipfs-datastore layer but you would need to make that datastore aware of the different disks or block storage available to the machine, as otherwise it makes little sense. I suppose software Reed-Solomon implementations for local storage already exist, but I don’t know how practical they are outside cold storage (raid 6 uses it for parity blocks I think). But if such things already exist, there’s no point in making it for IPFS blocks and having the IPFS-layer worry about something that can happen better at the filesystem layer.
Another option is to do it at a layer above IPFS. i.e. what you linked. Having IPFS Cluster implement erasure coding is an idea that has been around for long. Apart from other issues, the practical problem is that if you split a block in multiple parts and put the parts in multiple machines, each machine will have to reconstruct it every time it needs to provide it, fetching from other machines. This is not very practical and requires accounting etc. in a distributed system (Cluster) which is much more painful that in a local machine. That makes Cluster unsuitable for erasure coding right now.
Nevertheless, I think it is interesting to explore how erasure coding and IPFS could relate to each other. I.e. given an IPLD DAG, what transformation is necessary to obtain a new DAG (or DAGs) which represents the same information about the original DAG but Reed-Solomon encoded, and what would be the process to extract a block present in the original DAG from the Reed-Solomon one(s), given its CID? How would we carry the index etc… does this make sense at all?
If these questions are answered, then building a system that integrates Reed-Solomon into IPFS in some way would be much easier as IPFS already knows how to deal with DAGs, traverse, move then around.