Should we profile IPLD?

robin · September 23, 2024, 2:11pm

Building on the productive conversation we had about CIDs, I’m interested in hearing from the community about IPLD as well.

IPLD may be less foundational than CIDs, but it is still very close to the beating heart of IPFS. There too, I frequently hear from people complaining. Common issues I hear are:

The tooling isn’t there, people don’t know what to do with IPLD when they get it.
It’s too complex, the codecs make less predictable for developers handling data. More rarely, missing codecs.
Not worth it compared to deterministic CBOR or some other alternative.

This leads me to believe that there is potential to help people and increase adoption with some relatively limited changes that could well be backwards compatible (i.e. readable by previous processors). I would be curious to hear what people here think, however.

What are your IPLD pain points, if any?

lidel · September 25, 2024, 9:22pm

One of profiles could be “real world IPLD” we see today in IPFS Mainnet / Amino DHT. Those are interoperable, traversable DAGs in majority defined with dag-cbor and/or unixfs (dag-pb+raw).

Relatively, a very small set of codecs, but we also could go even simpler. We have CBOR Tag 42 registered at IANA as a away of linking to a CID from CBOR documents. It is probably the smallest building block we could use when reasoning about minimal “(no-)IPLD profiles”.

My understanding of major main points in ecosystem is the lack of tooling to do basic operations against these DAGs, namely we lack tooling for two operations:

(“diff”) for COMPARING two DAGs
(“patch”) for MODIFYING existing DAG without fetching the entire thing, ability to deterministically “patch” new data into existing [potentially huge] DAG/directory
- using a STABLE syntax (e.g. JSON-based, RFC 6902-inspired IPLD ♦ IPLD Patch Specs)
- without having to learn IPLD/codec internals.

There is more context and prior discussion about both in Kubo tickets (ipfs dag API aims to be interface for working with IPLD DAGs, however is is very limited atm):

hector · September 25, 2024, 10:31pm

I will take the opportunity to propose two new IPLD formats: “simple-file” and “simple-dir”. They follow the unix philosophy that text is the best format. I would like to spec them but until then the main features are:

Similar to dag-pb in that it represents files (made of chunks) and folders but made easy so that you can work on them without “ipld”-specific tooling:
Encoding is text-based, probably line-delimited entries for each chunk of a file or each folder entry. At worst json, but something the user can read without tools, and definitely not binary.
Opinionated chunk size set to 1MB.
Last entry in a root node can be a link to more of the same (i.e for large files/dirs), similar to trickle layout or ext filesystem inodes.

Diffing is diffing text files. Patching would be usually appending. Running on small hardware or niche platforms does not require protobuf or cbor support and blocks can be worked by hand.

It obviously has shortcomings wrt dag-pb in many aspects. But many use-cases of ipfs (i.e. 99% of websites etc.) have no use for file perms, hamts, quick lookups, . Working with dag-pb protobufs is a complexity which forces people to use existing tooling and learn it. it is not easy to just script something in your favorite language.

I think IPLD can be made simple to understand and work with for our main thing, which is “files”, but we force people to deal with Protobufs or otherwise cbor, and mostly they have to rely on our tooling for it. I want Arduinos and ESP32s to speak IPFS without needing to rely on someone maintaining some C-library that does heavy-lifting. It would also be good to demystify IPLD by having dumb codecs for dumb things (and a small file or a small folder are dumb things).

mosh · October 1, 2024, 1:26pm

Last week ATProto merged this PR de-emphasizing IPLD in the specs. The 2 reasons cited:

IPLD can be confusing for devs digging into ATProto for the first time
Lack of acceptance by multi-stakeholder governing body (IETF, W3C, etc.)

The ability to fix #1 is entirely in our community’s control. #2 is less straightforward.

github.com/bluesky-social/atproto-website

specs: remove many references to IPLD

bluesky-social:main ← bluesky-social:bnewbold/ipld-erasure

opened 09:31PM - 26 Sep 24 UTC

bnewbold

+24 -18

This strongly de-emphasizes IPLD in the specs. It isn't entirely removed: the da…ta model page now has an explicit section discussion the history and connection between atproto data and IPLD. The motivation here is two-fold: IPLD can be confusing to devs digging in to atproto for the first time, and there is some potential standards body baggage/history. In terms of dev understanding: IPLD is conceptually complex, and has a bunch of features/ecosystem we don't use. atproto early adopter devs may have been already familiar with IPLD so the reference was an asset, but now it is probably more confusing. Keeping some reference is helpful in case folks want to re-use IPLD implementation libraries or storage projects. In terms of standards: we want to pare down our standards "dependency tree" as much as possible. IPLD has not (yet) been formalized in an established multi-stakeholder body like the W3C, IETF, or ISO, and that can be friction when taking standards to such bodies. For our use of DAG-CBOR, we can reference the relevant normalization rules discussed in the CBOR RFCs themselves.

JohnT · October 4, 2024, 4:05pm

“Last entry in a root node can be a link to more of the same”

The current large files in UnixFS, I believe, leave it up to the chunker to decide which links go to another stem vs. a leaf. So this kind of use case is available if you want a flatter DAG like that. My thinking is more about the directories… what I would’ve loved is to have an optional flag (or logical equivalent) on a dirent that means something like:
(so you have for each entry: name, link, flag)

false/missing - link points to the root of a file/directory named name, as usual
true - link points to a node of the same type as this one, which is a continuation of the current directory containing all items whose name is >= name and < the name of the next entry (if this isn’t the last one)

This would be a more general form of what you’re describing, while still allowing fast(ish) lookups into giant directories like /wiki/ without requiring very peculiar hashing and bit twiddling algorithms like the HAMT sharded dir does.

On a separate note: could you elaborate on your concerns with protobuf? We use tooling for text data, too (editors). And protobuf tooling is readily available for: (from Overview | Protocol Buffers Documentation)

Is there a lot of demand for implementing UnixFS in languages other than that? Or is it some particular headache I’m not seeing? The only headache related to protobuf I’ve run into is being very careful to include headers from the same version as the protoc compiler I’m using (so if I’m building a library outside Chromium, system libraries, if inside Chromium use their headers, and be careful not to mix them). But it wouldn’t be in the top 5 things I complained about.

adin · October 6, 2024, 5:35am

IMO this is not a thing, because it’s not the right framing of the problems in IPLD-land at the moment. I’m going to take a step back to be more meta since IMO it helps us get to what is IMO the problem (and a possible solution) for what could be next for IPLD.

TLDR: If this is as far as you get and you want something actionable I think we need a way of representing IPLD beyond UnixFS in URIs.

What is IPLD anyway? My understanding is that the current setup has it in a similar space to libp2p, multiformats, and IPFS which is basically a core concept + a pile of specs (not talking anything that fancy, but the kind of thing you’d expect multiple implementations to have rather than just details of a specific implementation) that you may choose to adopt from:

multiformats:
- core concept: your format today will not work for everything, so for the low cost of a few bytes let’s describe what we’re talking about and allow for evolution / integration of new formats
- specs: multihash, multibase, multiaddr
libp2p:
- core concept: p2p networks come in many shapes and sizes but frequently solve similar problems, separating out the pieces allows for evolution of networks as well as interoperability between them
- specs: identify, peerIDs, multistream, libp2p’s specified usage of TLS, Noise, Yamux, Mplex, QUIC, WebTransport, …, gossipsub, kad-dht, and many more
IPFS
- core concept: It should be possible to address data by what it is, rather than where it is and since the data is verifiable it can come on a variety of transports from a variety of sources.
- specs: Amino DHT, UnixFS, IPFS HTTP Gateway, IPNI, …
  - Note: what falls into the “specs” bucket of an open project like this isn’t always a clear line. The IPNI specs aren’t listed in https://specs.ipfs.tech, but they seem relevant to the IPFS ecosystem. Similarly, it could go either way if UnixFS is considered an IPFS or IPLD spec.
IPLD
- core concept: content addressable data comes in many shapes and sizes but frequently work in similar ways to solve similar problems, if we can have shared tooling across different content addressable data structures we’ll make it easier to build and evolve systems as well as have interoperability between them
- specs: IPLD selectors, IPLD schemas, and some codecs (dag-pb, dag-json, dag-cbor, dag-jose, git, the eth codecs, …), maybe CAR format

and so IMO the main failing of IPLD today is that we just don’t have enough compelling specs and tools to make the core concept of “reusable content-addressable tooling” really all that impactful.

Selectors: IMO not friendly enough to work with and/or not powerful enough to justify its use, but opinions aside I have not heard of much traction here
Schemas /: Seem to have proven useful to folks in a number of scenarios to describe content addressable data even when in practice only a single “codec” is used for the data since people are building unique data structures out of the basic building blocks. In practice working with multiblock data structures (e.g. large maps, files, etc.) here is still painful.
Codecs /: There are a number of these and they do perform translations into the IPLD Data Model, but IMO the utility of moving through the data model isn’t currently enough. The main things I see people do with codecs are:
1. Encode their data → Not really more effort than just reusing code for a single data encoding (e.g. dag-cbor)
2. Ask for an entire DAG (i.e. follow all the links) → Good, but people frequently need smaller sub-DAGs
3. Convert the individual blocks or components from a less readable binary format into a more readable format → Nice, although probably not valuable enough on its own. It also has issues:
  - For some data like the Filecoin HAMT, the Solana Yellowstone data, etc. the data is not readable even in this form. You need at least a schema transformation and even that might not really be enough. For example, decoding a HAMT is a multiblock data structure you might want to see as a map rather than the internals. Another is that multiaddresses have a binary and text format, converting from dag-cbor to dag-json will not handle that conversion and so you’ll either have unreadable base64 dag-json or verbose text in dag-cbor.
CAR format : Despite its issues CAR(v1) has been adopted by a number of projects and seems to do the basic job of moving around content addressable data independent of the formatting. It’s a bit of an outlier as an IPLD spec in that its job is really at the block / multihash layer (similar to protocols like Bitswap) where the most compatibility lives rather than anything fancier (e.g. IPLD Data Model) it’s probably gained the most adoption of any of these specs.

Where do we go from here?

IMO IPLD needs at least one of two things to build momentum:

Great tooling for building new content addressable formats (or reusing existing ones like dag-cbor and IPLD HAMTs / AMTs)
A compelling reason for someone with an existing data format to write some glue code that allows them to leverage other existing tooling from the IPLD ecosystem

People seem to have spent a good deal of time thinking about #1 in a way I personally don’t feel has been that successful. It could be valuable, but my opinion is that the second is more valuable and under explored. This is because it allows people to show up to the ecosystem “later” and still get lots of benefits without a huge rewrite and because it’s easier to build a good abstraction when you already have multiple implementations that you’d want to fit the abstraction rather than building the abstraction and hoping the implementations will fit nicely when they’re built in the future.

In many ways this is similar to my post from 2022 . If there’s limited value from being able to work with UnixFS, BitTorrent, Git, Ethereum, Filecoin, etc. data via the same tooling are we really equipped to build tooling for the next 10 formats to share?

A concrete proposal could involve creating an IPLD URI scheme that allows interacting sanely with not just UnixFS data, but also others like BitTorrent, Filecoin, Solana Yellowstone, etc. with the next stop being to make IPFS and IPLD implementations that can safely handle large blocks coming from p2p sources (see the linked post for more info).

On a less technical note, I think IMO the IPLD docs website is really in the weeds and seems to say “welcome to IPLD, here’s a brain dump of everything we’ve discovered in trying to create content addressable data structures for when you decide what you’d like to do for yours”. That’s certainly interesting and content I’ve pointed people at when they were in fact making their own new formats or trying to understand tradeoffs in approaches, but for many people I don’t think that’s what they’d want to see as the front door. In that way having profiles or personas to help guide people to what they actually want to see might be useful.

I recognize this post is crazy long already, but for those interested Juan’s talk at FIL Dev Summit 2023 on the Filecoin Data Layer has a number of interesting / controversial discussions. Here are a couple where I’m also chiming in around people trying to make existing formats work with IPLD/IPFS and what does the spectrum of compatibility with IPLD/IPFS mean for these existing formats

hector · October 6, 2024, 9:36pm

While well supported across languages, it’s binary. You need to compile the protobuf schema to be able to work with them. A user getting a dag-pb block without previous knowledge of what is going on needs to 1) figure out that this is a protobuf 2) Find the protobuf schema 3) Import protobuf libraries into their project 4) In most cases for languages above, implement unix-fs (not an easy task!). And then they can finally put together a cat picture, or an index.html, or a folder with 3 files. That makes no sense to me

When you work with a REST API, for example, you can craft a request by writing json by hand without any json library support. You can get going because you can read things. I would like readable IPLD, not for the advanced use-case, but for the easy one.

hector · October 6, 2024, 9:50pm

CAR format is used because it is stupid. Anyone can glue together a CAR reader/writer for whatever.

One confusing thing is that dag-cbor or dag-json are codecs which can be used to represent any data. Is dag-pb a codec? I would say no, I would say this is a file format.

Back in the day I wanted to have a custom “codec” for a custom file format in cbor (I think) and I was told, “this is not how things work: use the cbor codec”.

If I’m not mistaken, ATProto is using cbor/json for encoding messages. Yet they don’t have their own format. When I see a dag-pb CID I know this is unixfs and thus know what to expect. I don’t have that luxury with the CID of a ATproto message, which just says “cbor”. There’s no embedded reference to the IPLD schema that I should be using either to understand things.

Imho it would be great to be able to distinguish file formats from their CIDs and the fact that something is json shouldn’t mean json-codec automatically, but regardless of that, the special status of dag-pb is inconsistent.

bumblefudge · October 22, 2024, 8:50am

The inner archivist in me demands that, for completeness, we include a link here to a drastic IPLD-v2 type idea based on BAOs proposed by @Gozala in the Storacha RFC process that would replace the “hash of raw bytes” CIDs in IPLD structures with IPLD-type-aware links, which are basically an alternate CID structure (and thus proposed in the single-byte range of the multicodec table). It doesn’t seem to have gotten much further than an idea at the time Irakli stepped back, and I think it’s safest to assume Irakli is busy and happy working on other problems now and should not be bothered with this stuff. But just in case anyone’s collecting weird ideas about relayering IPLD, it’s an interesting one! As Hannah pointed out in February, this would be an extremely breaking change, and probably only worth considering in a future where IPLD DAG usecases and “file”-based usecases have bifurcated a bit, and systems that handle one can safely ignore increasingly complex requirements of handling the other

Topic		Replies	Views
Blueprint of a distributed social network on IPFS (2) [blog article]	10	2116	January 25, 2019
What is IPLD? \| IPFS Blog & News Blog Posts	2	480	November 3, 2021
Which one is the preferred serialization format for IPLD implementations? Help ipld	1	668	November 5, 2017
Help in clarifying confusion with Merkledag and IPLD Help ipld	6	811	July 12, 2021
XML as an IPLD format Ecosystem and Usage ipld	3	710	April 12, 2018

Should we profile IPLD?

Where do we go from here?

Related topics