Two IPFS addresses for one file - how this happened?

bafybeifzzhive2xstwah73sgoxeaeypgqmc4ystobooeapnyr5itbbqzma
bafybeibthjnw3icvz2rbgq3hajsleyql6to42perjeoabtrolvw2te3ev4

These two different addresses point to the completely identical files - one uploaded with storacha and the other uploaded with pinata. How is this possible? Is it a bug?

It says they are identical on a cmp level on Linux byte-by-byte comparison, but there are two addresses for them now.

TLDR: IPFS supports content addressing and there are multiple ways to hash a file into an address. So while a single CID will always only verify those bytes, each set of bytes could be represented by multiple different CIDs.

Some examples:

  • You could use a different hash function (e.g. SHA2-256, SHA2-512, SHA3-256, Blake3, etc.)
  • You could use the UnixFS specification to encode your files / folders instead of just hashing the file bytes (within the IPFS ecosystem this is predominantly how files are worked with unless they’re small, e.g. under 2MiB, in which case they might just be hashing the file bytes)
  • While using UnixFS you could choose any number of possible ways to ingest your file
    • You could use fixed size chunks that are smaller (e.g. 256KiB) or larger (e.g. 1MiB), you could have content-based chunkers like those based on Rabin fingerprints, have larger/smaller fanouts/depth of the tree for larger files, etc.
    • People even come up with interesting content/application specific chunking schemes like IPFS Custom File Chunking for WARC and WACZ , all of which are compatible and readable by IPFS applications that implement UnixFS.
    • You can use https://dag.ipfs.tech to visualize a few of the possible configuration options

Note: There are some people interested in enumerating some of the most common ā€œprofilesā€ / ways of encoding files/folders into UnixFS in this post Should we profile CIDs? .

You can also inspect how the file was chunked in each case:

Mentioned services (Storacha vs Pinata) use different UnixFS import parameters while onboarding the data.

There is an effort in creating a vendor-agnostic standards/conventions:

Until it exists, you can’t trust two services to produce the same DAG, and if you need that guarantee, you need to create DAG yourself, and then import it (e.g. as a CAR) to a third-party service, to preserve your DAG shape and CIDs.

I support! I support it very much.

You already may pack the file different ways or I believe just rename it, and there will be other CID, but there are different cids for identical files? Isn’t that was the initial idea of IPFS as a whole, the data consistency? I’m a bit dissapointed.

I feel you have incorrect understanding/expectation what a subset of IPFS named UnixFS does. Adin linked you some resources, there is also a good video on content addressing and chunking in Content Addressing | Protocol Labs Research, especially part on Chunking that starts at 08:25 mark.

If useful, a very (over)simplified TLDR for UnixFS DAGs:

  • UnixFS CID is not an unique identifier of a file, it is an unique identifier of a DAG representation of that file.
  • A small file could be a DAG with only one node without (no chunking) or dag-pb wrapping chunk(s), but it is still a DAG abstraction at the end of the day.
  • Every file can be chunked, and every chunked representation means infinite number of possible DAGs.
  • You get data consistency and integrity check for a specific DAG. This is a different abstraction that a file.
  • In other words, once you create a DAG, IPFS provides cryptographic guarantees of consistency/integrity for that DAG. The person who onboards data and creates CID is in control. Others are read-only.
  • You have no control over other people: they may have different needs, choose to build different DAGs because they may be more performant for their use cases (transfer speed, deduplication etc)
  • Folks who need a shared convention for creating UnixFS DAGs, so the same file/directory always produces same DAG across implementations and services should engage in standardization process in IPIP-499.