Hello everyone!
I really like the ideas behind IPFS and I want to share some feedback about the design of the IPFS.
The core of the problem is that CID concept is wrong in it’s current implementation. I know it sounds blunt and harsh so let me clarify:
IPFS at its core claims to be a content addressable file system.
Content address
A content’s address is determined by it’s digest. Simply put - a stream of bytes that represent a file on the disk fed into a hashing function give us the digest.
content
== a stream of bytes that represent a file on the disk
this is important, because only the end result (a file on the disk) is important to the end user.
CID != content address
But then if we follow the documentation or any blog post about IPFS and we learn that IPFS actually does not use file’s digest as its address. There is a thing called CID and it mixes a bunch of concepts together, some of them are correct, and some of them are wrong.
And users are supposed to fetch content based on the CID
CID is
- multibase-prefix
- multicodec-cidv1
- multicodec-content-type
- multihash-content-address
…
So the same file can havex1* x2 * x3... * xN
CIDs whereN
is a a number elements that form a CID
Normally this shouldn’t be a problem, but there is no normalization between the different encoding schemes because … (see below)
Multihash != content address!
IPFS stores the content by the multihash, which is not content’s address either.
multi-hash is a hash of a DAG that this particular file was sliced into. Not only you can use different hashing functions, you can also chunk the DAG differently, that produces different hashes.
ipfs add --cid-version 1 --chunker=size-1 cat.jpg
bafybeigmitjgwhpx2vgrzp7knbqdu2ju5ytyibfybll7tfb7eqjqujtd3y cat.jpg
ipfs add --cid-version 1 --chunker=size-2 cat.jpg
bafkreicdkwsgwgotjdoc6v6ai34o6y6ukohlxe3aadz4t3uvjitumdoymu cat.jpg
So the same content can be represented by M * N
DAGS where M - number of hashing functions available, N - number of chunking options available
So DAG encoding != content’s address
The problems
- IPFS claims to have automatic content de-duplication, it it only works inside one hashing-chunking scheme. It is in fact possible to have M*N duplicates of the same file stored alongside each other.
- Convertiing CIDs is not possible without having the file on disk, so different peers can have same content under different CID’s, which fragments the network into incompatible segments or forces everyone to have M * N copies of the content to allow interoperability
- Given CID forces specific file chunk layout onto everyone who downloads the content, which will cause storage and network performance problems (different OSes, different storage devices, different FS cache sizes etc, different network connectivity, not to mention duplication of OS kernel functionality in user space)
Proposed solution
Do not address by DAG encoding/ layout, address by file’s content.
Content Id(CID) should have the content hash instead of the DAG hash. With this it becomes possible to normalize CIDs. Once a peer gets a file - it can calculate multiple hashes of the content and advertise them all.
A particular DAG layout can be requested by the peer at transfer time.
Now It is also up to the peer how to store the content ( in chunks or as a single file, let OS handle it or deal with it in user space…)