Uploading CARs and user generated CIDs

TL;DR: CARs are a great way to ensure end-to-end integrity for user uploads, but working with them is hard and they aren’t broadly supported.

Why CAR uploads matter?

As part of the Dapps Working Group, we spend a lot of time thinking and providing tooling to make verified retrieval the norm.

However, verified retrieval of CIDs is only half of the full picture. The other side includes creating and onboarding content addressable data.

The lifecycle of data on IPFS starts with the data getting “merkelised” (for lack of a better term), i.e., chunked and transformed into a DAG with a root CID.

When working with user authored data, i.e. in apps and dapps, there’s a strong motivation to have the CID generated as close as possible to the user, i.e. in the browser to ensure end-to-end integrity and reduce trust on “trusted servers” to generate the CID.

The web3.storage team has championed this approach, which is not without its challenges, especially if you’re working with larger files that have to be merkelised on the user frontend. In fact, they’ve gone to lengths to make this approach work even for large files by merkelising and uploading incrementally in streaming style, so that you don’t exhaust memory.

This has been a net win for the ecosystem!

I don’t want to just falsely paint a rosy picture, because we still have more progress to make, and I’m not 100% certain on the best way to move forward. So I’ll just share my experience.

I spent some time today testing CAR uploads from the browser in an app using pinning services. My motivation was to get a feel for what the story for “providing content addressed data from browsers” looks like.

The Pinning API spec was ruled out because of it’s not broadly supported and it’s still hard to provide from browser (no returned delegates and getting the limitations on getting blocks to the pinning service due to transports)

So I looked into CAR uploads and the challenge there is that only Infura, web3storage, and Filebase support CAR uploads and each with a different API.

I ended up using web3storage because it supports delegated uploads (where you generate a delegation that allows direct uploads by the user).

To get it working first, I went with a client-server approach where the user uploads to an endpoint controlled by me which then uploads to web3.storage. Unfortunately, I hit an error from the w3storage client running in Cloudflare worker.

I have to say that some working with the web3.storage APIs is a challenge because of DIDs/UCANs. Just take a look at the code and the necessary imports and data transformations just to upload a CAR

You may argue that all of this is necessary to do delegated uploads (aka presigned upload URLs), but this is not a new pattern…. Look how simple the example here is: Delegated upload tokens · api.video documentation. Just an HTTP request with a token string. No DIDs, no parsing proofs, no signers, no principals, no transforming base64 to uint8arrays.

I love what the folks at Web3.storage are doing, but believe we need to compress the cognitive overhead required to upload CARs.

Where do we go from here?

In the short term, I think sharing this feedback with the Web3.storage might lead to some better abstractions.

Side note

The spec for a data onboarding endpoint (with CAR support) is tangentially related.

I can imagine that adding a UCAN/DID layer on top of a generic HTTP API could be an interesting approach to ecplore


Am I missing something?
What are your thoughts?

2 Likes

@danieln we’re currently working on a client refactor that will make using it significantly more enjoyable than it is right now. It’ll be out VERY soon!

We went through a number of iterations of the service before we finally settled where we are today and an unfortunate artifact of that is that the client is a little clunky right now.

The new client will likely only require a single import statement and parseProof will not be necessary. I think (hope) this basically covers your main gripes here…

The delegated upload tokens example you linked to looks like a centralized service that provides API tokens. Yes that is much simplier, but no it is not web3.

If you can open an issue on Issues · web3-storage/w3up · GitHub stating the error you experienced we can help you as well as others in the community.

1 Like

To add one more thing to what @alanshaw just said. DIDs and UCANs are what allows delegated uploads and allows end user and not a server in the middle to be in charge, so part of what you found appealing about web3.storage in comparison is the cause of the complexity.

That said if that is not the right compromise we are also building a HTTP bridge that anyone can run and not deal with any of the UCAN, although it shares same tradeoffs as regular JWT / OAuth tokens, you’ll have to keep them secret. Here are links to an ongoing effort there

3 Likes

Ah this might be interesting for wovin team to integrate All-in-one Docker image with IPFS node best practices

cc @gotjoshua @tennox

2 Likes

The market doesn’t seem to care about CAR uploads.

We’ll have an end to end CAR go mirror soon-ish that has two features:

  • synch
  • faster than re-uploading — because it uses bloom filters to negotiate what blocks need to get transferred over HTTPs

I’ll bug Philipp to drop some links here.

Note that this will be “centralized” but anyone can run Kubo with the plugin.

I don’t know that converging on one pattern here for uploads is the right thing. I’m most interested in partial synch across platforms — which is what dapps mostly need.

2 Likes

Will be the first to admit that I have limited experience with UCANs, but can you help me understand a bit more what’s going on here?

Why would you need to keep anything secret? Perhaps I’m misunderstanding but right now you have a GraphQL-like API where the request data is in the body like specs/w3-store.md at eb39253374483ee878395a4a3d0e16d07bc802ff · web3-storage/specs · GitHub

{
  can: "store/add",
  with: "did:key:abc...",
  nb: {
    link: "bag...",
    size: 1234
  }
}

being posted to myendpoint.tld when you could do something like:

curl -X POST https://myendpoint.tld/store/add?link=bag...&size=1234 -H "Authorization: UCAN <ucan-auth-data>"

It seems you would end up with having something standardized such that if people wanted to go the JWT route or whatever they could and all the APIs would be roughly the same (e.g. a storage provider that didn’t want to use UCANs could conform to the same space as you and later if they decide they want to support delegation the API change is much smaller)

1 Like

FWIW I agree that this seems sensible. However, I think there’s room for both.

In particular, some storage providers (e.g. web3.storage) have built their infra to optimize around storage of CARs including to the point of handing users back proofs that will allow them to see their CARs being provably stored by Filecoin SPs. You could ask them (some are on this thread already) about it, but IIUC the play is basically to drive down costs by getting users to use clients that upload data in a way that is cost effective for the storage providers. This is somewhat independent of tools like UCANs and Filecoin proofs which might be able to exist independently of this particular approach (although likely with more pain).

However, yes this upload CAR API makes mutating data and sending it to a storage provider a much more painful experience. For example, while I might be happy to store a copy of https://dist.ipfs.tech/ with web3.storage the lack of support for the Pinning Service API (or really any sync tool) would put the onus on me to figure out what data was already stored by me with web3.storage compute the delta locally and then upload it. Maybe if the client did that for me I’d be happy enough, but I having better sync APIs (including ways to describe the entities being synced beyond just a root CID) sounds great too. At the moment things it’s a much smaller developer lift to store the various versions of dist.ipfs.tech with kubo/ipfs-cluster nodes or anyone else supporting the Pinning Service API (unless they have behaviors like downloading and/or charging you for full the full DAG for each update).

My point was not to diss CAR.

As I said, we’ve implemented CAR go mirror over https transport because it solves the problem of being performant and being reliable on all clients, including having low battery usage on mobile.

My observation is that commercial operators tend to be pretty far away from protocol advancements.

Look at all the Qm CIDs out there!

And end users (devs) will use things that fit together and meet their needs, and those that unlock new use cases / features.

1 Like

I’m misunderstanding but right now you have a GraphQL-like API where the request data is in the body

So, the body IS the signed UCAN invocation. It includes the capability (can), resource (with) and parameters (nb). There’s no need to split these things out and put them in the URL - you’d lose verifiability of the requested invocation (unless you can repeat them in the UCAN).

You can (PUN INTENDED) put a UCAN in an Auth header, there’s a spec for that: GitHub - ucan-wg/ucan-http-bearer-token: UCAN-over-HTTP-Header Specification We’re using that spec in the HTTP bridge that @Gozala mentioned above.

In particular, some storage providers (e.g. web3.storage) have built their infra to optimize around storage of CARs

Just to clarify:

web3.storage built it’s infra around storage of content addressed data. CAR files are a reasonable way to send IPFS DAGs around today. If you send web3.storage a CAR file the blocks inside can be indexed and served over existing IPFS transports like bitswap and via IPFS gateways.

The system is designed so that you do not have to send CAR files. You tell web3.storage you want to store a CID (typically the hash of a CAR file) of a certain size (in bytes), and you get back a signed URL allowing only the specified number of bytes that hash to the given hash to be uploaded.

In the future, maybe web3.storage (or Filecoin!?) will do something smarter with non-CAR uploads. You can actually send non-CAR data to web3.storage today, but the service won’t do anything with it…yet, so I’d not recommend it.

including to the point of handing users back proofs that will allow them to see their CARs being provably stored by Filecoin SPs.

Hmm, not sure what you’re getting at here, the (PoDSI) proof has nothing to do with the type of the data. There is no mention of CAR files here: FIPs/FRCs/frc-0058.md at master · filecoin-project/FIPs · GitHub

…but yes web3.storage does expose proof of data aggregation to users, and through UCAN receipt chains they can verifiably trace their upload through the entire aggregation pipeline.

the lack of support for the Pinning Service API (or really any sync tool) would put the onus on me to figure out what data was already stored by me with web3.storage compute the delta locally and then upload it. Maybe if the client did that for me I’d be happy enough

We need smarter clients. I don’t think we should expect remote services to perform unbounded work to determine and retrieve a sub-DAG whose size is unknowable from a remote node(s?) that may not be reachable.

1 Like

Sure you can repeat the information or it can be implied from the path (i.e. before you compute the hash used for the signature you use the data from the path + query parameters). The reason to split these things out is, as mentioned, because the core of the API (e.g. asking for somewhere to dump a file) is unrelated to the permission system (UCAN) and so you’ve built an API that others cannot use without taking your permissioning system too which reduces compatibility and the ability for users to switch between providers.

Looking at that PR, it looks like the API for /bridge is special (i.e. doesn’t really match the non-bridge API, which is my point). You could’ve made it the same API for the bridge and not by computing the UCAN. This would also be the kind of things others who are leveraging similar patterns could reuse. Similarly, for people who prefer REST-like vs GraphQL-like semantics (AFAICT most other services have gone with REST-like behavior).

Not with the encoding, but I mentioned that in relation to supporting “sync” behavior. If I asked you to only store the delta between old-dag(s) and new-dag the proof would be either less useful or much more annoying for the user to verify.

Who said anything about unbounded, or fetching data from unreachable nodes? I understand you’re coming from concerns about the current pinning service API (some of which I disagree with/don’t think are as problematic as you do), however that’s not the only option.

I put up a use case: a directory that evolves over time and I want old and new versions to both be available without being charged for the storage of a full clone of the data, or having to do a bunch of manual tracking work myself. A CAR upload API simply isn’t enough for this.

@boris similarly expressed interest in sync rather than just upload. He’s probably thinking more generally but consider that WNFS supports versioning which would be a big pain with a CAR-upload-only API just as in my example.

1 Like

Back on topic: @danieln shared some feedback and wants to be able to confidently point developers at an approach to uploading to IPFS.

And: that multiple commercial providers had different approaches.

I think that product teams are going to pick what works for them for a variety of reasons. I’d love to pay someone to take my synch and persistence business, but so far teams haven’t found that a compelling market segment.

(Maybe back off topic from here :laughing:)

As newly independent entities, what areas do we work together on? Not sure.

I’d love to maybe do something like look at what our engineering dependency trees look like from a library / module perspective.

I am interested in what it looks like for clusters of hosts to coordinate around IPFS Mainnet.

This is all emergent stuff.

2 Likes

No. web3.storage originally required data to be uploaded as a CAR to ensure that the users created their own dags and derived their own CID for the data, that we then promise to store for them. Otherwise we’d be just another centralised upload service that you’d have to trust.

Now you can upload any bytes to us as long as you also tell us the CID for those bytes in advance. The tools we provide encode your files as CARs, which also works nicely with being able to aggregate them into Filecoin deals in a way they can verify it.

2 Likes

We want to provide better support for this. It sounds like @boris will soon too. We should talk!

From the w3s side it has been hard for us to prioritise this use case. I ran the numbers while we were still using ipfs-cluster+kubo as the main store and for ~1PiB only about 0.1% of it was de-duplicated data. I don’t doubt the use case, and the opportunities to speed up uploads with diffs, but our users were almost entirely uploading novel content.

1 Like