These two different addresses point to the completely identical files - one uploaded with storacha and the other uploaded with pinata. How is this possible? Is it a bug?
TLDR: IPFS supports content addressing and there are multiple ways to hash a file into an address. So while a single CID will always only verify those bytes, each set of bytes could be represented by multiple different CIDs.
Some examples:
You could use a different hash function (e.g. SHA2-256, SHA2-512, SHA3-256, Blake3, etc.)
You could use the UnixFS specification to encode your files / folders instead of just hashing the file bytes (within the IPFS ecosystem this is predominantly how files are worked with unless theyāre small, e.g. under 2MiB, in which case they might just be hashing the file bytes)
While using UnixFS you could choose any number of possible ways to ingest your file
You could use fixed size chunks that are smaller (e.g. 256KiB) or larger (e.g. 1MiB), you could have content-based chunkers like those based on Rabin fingerprints, have larger/smaller fanouts/depth of the tree for larger files, etc.
People even come up with interesting content/application specific chunking schemes like IPFS Custom File Chunking for WARC and WACZ , all of which are compatible and readable by IPFS applications that implement UnixFS.
You can use https://dag.ipfs.tech to visualize a few of the possible configuration options
Note: There are some people interested in enumerating some of the most common āprofilesā / ways of encoding files/folders into UnixFS in this post Should we profile CIDs? .
Mentioned services (Storacha vs Pinata) use different UnixFS import parameters while onboarding the data.
There is an effort in creating a vendor-agnostic standards/conventions:
Until it exists, you canāt trust two services to produce the same DAG, and if you need that guarantee, you need to create DAG yourself, and then import it (e.g. as a CAR) to a third-party service, to preserve your DAG shape and CIDs.
You already may pack the file different ways or I believe just rename it, and there will be other CID, but there are different cids for identical files? Isnāt that was the initial idea of IPFS as a whole, the data consistency? Iām a bit dissapointed.
I feel you have incorrect understanding/expectation what a subset of IPFS named UnixFS does. Adin linked you some resources, there is also a good video on content addressing and chunking in Content Addressing | Protocol Labs Research, especially part on Chunking that starts at 08:25 mark.
If useful, a very (over)simplified TLDR for UnixFS DAGs:
UnixFS CID is not an unique identifier of a file, it is an unique identifier of a DAG representation of that file.
A small file could be a DAG with only one node without (no chunking) or dag-pb wrapping chunk(s), but it is still a DAG abstraction at the end of the day.
Every file can be chunked, and every chunked representation means infinite number of possible DAGs.
You get data consistency and integrity check for a specific DAG. This is a different abstraction that a file.
In other words, once you create a DAG, IPFS provides cryptographic guarantees of consistency/integrity for that DAG. The person who onboards data and creates CID is in control. Others are read-only.
You have no control over other people: they may have different needs, choose to build different DAGs because they may be more performant for their use cases (transfer speed, deduplication etc)
Folks who need a shared convention for creating UnixFS DAGs, so the same file/directory always produces same DAG across implementations and services should engage in standardization process in IPIP-499.