Uploading CARs and user generated CIDs

danieln · February 21, 2024, 1:22pm

TL;DR: CARs are a great way to ensure end-to-end integrity for user uploads, but working with them is hard and they aren’t broadly supported.

Why CAR uploads matter?

As part of the Dapps Working Group, we spend a lot of time thinking and providing tooling to make verified retrieval the norm.

However, verified retrieval of CIDs is only half of the full picture. The other side includes creating and onboarding content addressable data.

The lifecycle of data on IPFS starts with the data getting “merkelised” (for lack of a better term), i.e., chunked and transformed into a DAG with a root CID.

When working with user authored data, i.e. in apps and dapps, there’s a strong motivation to have the CID generated as close as possible to the user, i.e. in the browser to ensure end-to-end integrity and reduce trust on “trusted servers” to generate the CID.

The web3.storage team has championed this approach, which is not without its challenges, especially if you’re working with larger files that have to be merkelised on the user frontend. In fact, they’ve gone to lengths to make this approach work even for large files by merkelising and uploading incrementally in streaming style, so that you don’t exhaust memory.

This has been a net win for the ecosystem!

I don’t want to just falsely paint a rosy picture, because we still have more progress to make, and I’m not 100% certain on the best way to move forward. So I’ll just share my experience.

I spent some time today testing CAR uploads from the browser in an app using pinning services. My motivation was to get a feel for what the story for “providing content addressed data from browsers” looks like.

The Pinning API spec was ruled out because of it’s not broadly supported and it’s still hard to provide from browser (no returned delegates and getting the limitations on getting blocks to the pinning service due to transports)

So I looked into CAR uploads and the challenge there is that only Infura, web3storage, and Filebase support CAR uploads and each with a different API.

I ended up using web3storage because it supports delegated uploads (where you generate a delegation that allows direct uploads by the user).

To get it working first, I went with a client-server approach where the user uploads to an endpoint controlled by me which then uploads to web3.storage. Unfortunately, I hit an error from the w3storage client running in Cloudflare worker.

I have to say that some working with the web3.storage APIs is a challenge because of DIDs/UCANs. Just take a look at the code and the necessary imports and data transformations just to upload a CAR

You may argue that all of this is necessary to do delegated uploads (aka presigned upload URLs), but this is not a new pattern…. Look how simple the example here is: Delegated upload tokens · api.video documentation. Just an HTTP request with a token string. No DIDs, no parsing proofs, no signers, no principals, no transforming base64 to uint8arrays.

I love what the folks at Web3.storage are doing, but believe we need to compress the cognitive overhead required to upload CARs.

Where do we go from here?

In the short term, I think sharing this feedback with the Web3.storage might lead to some better abstractions.

Side note

The spec for a data onboarding endpoint (with CAR support) is tangentially related.

github.com/ipfs/specs

IPIP: Data Onboarding via HTTP POST (and future ipfs:// POST|PUT)

opened 09:34PM - 11 Mar 22 UTC

lidel

P2 need/analysis IPIP

## Problem statement HTTP Gateways are the most successful way for retrieving… content-addressed data. Successful use of HTTP for retrieval use cases proves that IPFS does not replace HTTP, but augment it by providing variability and resiliency. IPFS over HTTP brings more value than the sum of its parts. Removing the need for implementation specific RPC APIs (like one in Kubo) allowed not only faster adoption of CIDs on the web, but enabled alternative implementations of IPFS (like Iroh in Rust) to test compliance and benchmark thenselves against each other. While we have HTTP Gateways as a standard HTTP-based answer to the retrieval of data stored with IPFS (including verifiable [application/vnd.ipld.raw](https://www.iana.org/assignments/media-types/application/vnd.ipld.raw) and [application/vnd.ipld.car](https://www.iana.org/assignments/media-types/application/vnd.ipld.car) responses), the data onboarding over HTTP is currently done with vendor-specific APIs. The status quo at 2023 Q1 is pretty bad from the end user/developer’s perspective: every IPFS implementation, including online services providing storage and pinning services, exposes custom opinionated HTTP API for onboarding data to IPFS. ## Why we need IPIP for HTTP Data Onboarding To illustrate, some prominent examples (2022 Q4): <details> <summary>Click to expand :see_no_evil: </summary> - Implementations - Kubo RPC (AKA legacy /api/v0/..) - Is often used as a “standard HTTP API upload template” because it has commands for all onboarding needs: - [https://docs.ipfs.tech/reference/kubo/rpc/#api-v0-add](https://web.archive.org/web/20221201011916/https://docs.ipfs.tech/reference/kubo/rpc/#api-v0-add) – files and directories - FLAG: it uses custom form-data handling that requires special library for directory upload, which is an awful papercut for someone expecting simple upload with “curl” ([http://web.archive.org/web/20221201011916/https://docs.ipfs.tech/reference/kubo/rpc/#request-body](http://web.archive.org/web/20221201011916/https://docs.ipfs.tech/reference/kubo/rpc/#request-body)) - FLAG: Kubo RPC was never designed to be used in browser context, and there are known bugs around the way it handles uploads (example: [https://github.com/ipfs/kubo/issues/5168](https://github.com/ipfs/kubo/issues/5168)) - [https://docs.ipfs.tech/reference/kubo/rpc/#api-v0-block-put](https://docs.ipfs.tech/reference/kubo/rpc/#api-v0-block-put) – raw block - [https://docs.ipfs.tech/reference/kubo/rpc/#api-v0-dag-put](https://docs.ipfs.tech/reference/kubo/rpc/#api-v0-dag-put) – JSON-like documents and custom DAGs (DAG-JSON and DAG-CBOR) - [https://docs.ipfs.tech/reference/kubo/rpc/#api-v0-dag-import](https://docs.ipfs.tech/reference/kubo/rpc/#api-v0-dag-import) – arbitrary bags of blocks in CAR format - JS-IPFS - Reimplements most of the Kubo RPC and exposes it over HTTP,, but diverged long time ago and is not 1:1 - FLAG: In addition to HTTP, JS-IPFS exposes selected commands over gRPC-over-WebSockets, to work-around browser issues caused by Kubo RPC ([https://web.archive.org/web/20220528152743/https://github.com/ipfs/js-ipfs/tree/master/packages/ipfs-grpc-server#why](https://web.archive.org/web/20220528152743/https://github.com/ipfs/js-ipfs/tree/master/packages/ipfs-grpc-server#why)) - IPFS Cluster - Acts as a reverse proxy for Kubo RPC, but has own commands too and provides special behavior on top of what Kubo RPC does: - [https://web.archive.org/web/20220911053755/https://ipfscluster.io/documentation/reference/api/](http://web.archive.org/web/20220911053755/https://ipfscluster.io/documentation/reference/api/) – `/add` endpoint uses unixfs by default, but also accepts CARs when HTTP POST request is made with `?format=car` and it only accepts CARs with single root. - Online services - Pinata - [https://web.archive.org/web/20220930091452/https://docs.pinata.cloud/pinata-api/pinning/pin-file-or-directory](https://web.archive.org/web/20220930091452/https://docs.pinata.cloud/pinata-api/pinning/pin-file-or-directory) – onboarding file or directory - [https://web.archive.org/web/20220817122725/https://docs.pinata.cloud/pinata-api/pinning/pin-json](https://web.archive.org/web/20220817122725/https://docs.pinata.cloud/pinata-api/pinning/pin-json) – onboarding JSON document - web3storage - [http://web.archive.org/web/20220914153854/https://web3.storage/docs/reference/http-api/](http://web.archive.org/web/20220914153854/https://web3.storage/docs/reference/http-api/) – file and CAR uploads - note: no block API (impossible to import DAG-CBOR without the overhead of single-block-CAR for every CID) - Infura - [http://web.archive.org/web/20220429202905/https://docs.infura.io/infura/networks/ipfs/http-api-methods/add](http://web.archive.org/web/20220429202905/https://docs.infura.io/infura/networks/ipfs/http-api-methods/add) – file and directory import API that is carbon-copy of Kubo’s internal RPC API - [http://web.archive.org/web/20220429203039/https://docs.infura.io/infura/networks/ipfs/http-api-methods/block_put](http://web.archive.org/web/20220429203039/https://docs.infura.io/infura/networks/ipfs/http-api-methods/block_put) – raw block import that is carbon-copy of Kubo’s internal RPC API - note: no CAR import - TODO: source more examples </details> And the CAR upload API insanity corca 2024 Q1: - https://discuss.ipfs.tech/t/uploading-cars-and-user-generated-cids/17592 This state of things introduces an artificial barrier to adoption: the user needs to learn what APIs are available, and then “pick winners” – decide which implementations and services are the most future-proof. And even then, many choices are burdened by legacy of Kubo RPC and it’s degraded performance and DX/UX in web browsers. ## Goal: create data onboarding protocol for both HTTP and native IPFS The intention here is to create IPIP with a vendor-agnostic protocol for onboarding data that: - is easy to use and implement in HTTP (`POST https://`) - does not require any libraries or documentation, - and is as easy to work with from JS with `fetch` API as it is in the command-line with `curl` - follow the retrieval story, where `ipfs://` behavior is analogous to subdomain gateways - :point_right: what we want, is to have a protocol that can be represented as both `POST https://` AND `POST ipfs://` APIs ## IPIP scope We want two IPIPs: one for onboarding data with HTTP POST, and one for authoring (modifying/pathing) it with HTTP PUT. This allows us to ship most useful onboarding first, and then do authoring as an optional add-on, which services may support, but dont have to (if they are only onboarding to filecoin etc). For now, focusing on the POST ### POST Requests (Onboarding) > 👉 This is the minimal scope we need to cover from the day one, ensuring every use case has a vendor-agnostic spec. - **Delegated** - Single File (UnixFS) or single (DAG-)CBOR/JSON document - Arbitrary Directory tree (UnixFS) - Option A: TAR stream - open question: how does this handle interrupted upload? can server tell some data is missing? - Option B: custom form-data? (think twice, we have lessons learned around RPC at `/api/v0/add` in Kubo) - **Native** - Raw block - CAR stream The working code for this will be reference implementation that replaces/updates the legacy [`Gateway.Writable` feature in Kubo](https://github.com/ipfs/kubo/blob/master/docs/config.md#gatewaywritable) with the above feature set. ### PUT/PATCH/DELETE Requests (Authoring) This will be a separate IPIP, but flagging this as long term plans that should feel idiomatic too. - TBD: **Delegated** vs **Native** - Critical: ensure no surprises, UX/DX is paramount. Needs research and analysis. - One idea is to keep it limited to patching UnixFS paths and DAG-JSON/CBOR documents. - Other idea is to have syntax parity with JSON-based [IPLD Path](https://ipld.io/specs/patch/) and have the same JSON syntax as [`dag diff`](https://github.com/ipfs/kubo/issues/4801) and [`dag patch`](https://github.com/ipfs/kubo/issues/4782) commands. ## References - Revisit the [concept of Writable Gateways](https://discuss.ipfs.io/t/writeable-http-gateways/210?u=lidel) - https://github.com/ipfs/go-ipfs/blob/master/docs/config.md#gatewaywritable - https://discuss.ipfs.io/t/writeable-http-gateways/210 - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Location#pointing_to_a_new_document_http_201_created - WIP private IPIP draft: https://www.notion.so/protocollabs/wip-IPIP-Data-Onboarding-with-HTTP-POST-4c394b8ebb774f2d87d34466019257fc - Alex prototyped some REST APIs in https://github.com/ipfs/specs/pull/224/files (while this was intending to be update to Kubo RPC, the document includes some ideas around patching files and directories) - https://docs.api.video/vod/delegated-upload-tokens as prior art where [opaque token can be used with standard tools like curl](https://docs.api.video/vod/delegated-upload-tokens#upload-a-video-with-delegated-tokens)

I can imagine that adding a UCAN/DID layer on top of a generic HTTP API could be an interesting approach to ecplore

Am I missing something?
What are your thoughts?

alanshaw · February 21, 2024, 4:28pm

@danieln we’re currently working on a client refactor that will make using it significantly more enjoyable than it is right now. It’ll be out VERY soon!

We went through a number of iterations of the service before we finally settled where we are today and an unfortunate artifact of that is that the client is a little clunky right now.

The new client will likely only require a single import statement and parseProof will not be necessary. I think (hope) this basically covers your main gripes here…

The delegated upload tokens example you linked to looks like a centralized service that provides API tokens. Yes that is much simplier, but no it is not web3.

If you can open an issue on Issues · web3-storage/w3up · GitHub stating the error you experienced we can help you as well as others in the community.

Gozala · February 21, 2024, 5:06pm

To add one more thing to what @alanshaw just said. DIDs and UCANs are what allows delegated uploads and allows end user and not a server in the middle to be in charge, so part of what you found appealing about web3.storage in comparison is the cause of the complexity.

That said if that is not the right compromise we are also building a HTTP bridge that anyone can run and not deal with any of the UCAN, although it shares same tradeoffs as regular JWT / OAuth tokens, you’ll have to keep them secret. Here are links to an ongoing effort there

github.com/web3-storage/w3infra

wip: first draft of UCAN bridge

web3-storage:main ← web3-storage:feat/ucan-bridge

opened 12:18AM - 07 Feb 24 UTC

travis

+320 -1

To support users in languages that do not have existing UCAN invocation implemen…tations, we are going to launch a bridge that allows them to make simple HTTP requests with JSON bodies that we transform into proper UCAN invocations. So far this PR has: 1) an untested (but type-checking!) implementation of such a bridge 2) a markdown description of the bridge protocol, intended to be the first draft of an eventual specification (please review before the code!) Notable design choices: 1) I chose to include JUST the base64pad-encoded "secret" in the Authorization header to avoid running afoul of maximum header size restrictions that exist in [some HTTP environments](https://stackoverflow.com/questions/686217/maximum-on-http-header-values). 2) The `proof` field of the JSON body is a base64pad encoded "delegation archive" (created with `ucanto`'s `Delegation.archive` function) Values for both of these fields can be generated using the `bridge generate-tokens` w3cli command proposed here: https://github.com/web3-storage/w3cli/pull/175 TODO - [ ] factor core bridge logic out to a separate library - [ ] factor HTTP input wrangling out to a separate function - [ ] rename `UPLOAD_API_DID` and `ACCESS_SERVICE_URL` environment variables to `W3UP_SERVICE_DID` and `W3UP_SERVICE_URL` - [ ] add tests - [ ] expand and formalize bridge specification, move it to the specs repo (?) - [ ] document response format

boris · February 21, 2024, 7:22pm

Ah this might be interesting for wovin team to integrate All-in-one Docker image with IPFS node best practices

cc @gotjoshua @tennox

boris · February 22, 2024, 1:27am

The market doesn’t seem to care about CAR uploads.

We’ll have an end to end CAR go mirror soon-ish that has two features:

synch
faster than re-uploading — because it uses bloom filters to negotiate what blocks need to get transferred over HTTPs

I’ll bug Philipp to drop some links here.

Note that this will be “centralized” but anyone can run Kubo with the plugin.

I don’t know that converging on one pattern here for uploads is the right thing. I’m most interested in partial synch across platforms — which is what dapps mostly need.

adin · February 22, 2024, 1:54am

Will be the first to admit that I have limited experience with UCANs, but can you help me understand a bit more what’s going on here?

Why would you need to keep anything secret? Perhaps I’m misunderstanding but right now you have a GraphQL-like API where the request data is in the body like specs/w3-store.md at eb39253374483ee878395a4a3d0e16d07bc802ff · web3-storage/specs · GitHub

{
  can: "store/add",
  with: "did:key:abc...",
  nb: {
    link: "bag...",
    size: 1234
  }
}

being posted to myendpoint.tld when you could do something like:

curl -X POST https://myendpoint.tld/store/add?link=bag...&size=1234 -H "Authorization: UCAN <ucan-auth-data>"

It seems you would end up with having something standardized such that if people wanted to go the JWT route or whatever they could and all the APIs would be roughly the same (e.g. a storage provider that didn’t want to use UCANs could conform to the same space as you and later if they decide they want to support delegation the API change is much smaller)

adin · February 22, 2024, 2:42am

FWIW I agree that this seems sensible. However, I think there’s room for both.

In particular, some storage providers (e.g. web3.storage) have built their infra to optimize around storage of CARs including to the point of handing users back proofs that will allow them to see their CARs being provably stored by Filecoin SPs. You could ask them (some are on this thread already) about it, but IIUC the play is basically to drive down costs by getting users to use clients that upload data in a way that is cost effective for the storage providers. This is somewhat independent of tools like UCANs and Filecoin proofs which might be able to exist independently of this particular approach (although likely with more pain).

However, yes this upload CAR API makes mutating data and sending it to a storage provider a much more painful experience. For example, while I might be happy to store a copy of https://dist.ipfs.tech/ with web3.storage the lack of support for the Pinning Service API (or really any sync tool) would put the onus on me to figure out what data was already stored by me with web3.storage compute the delta locally and then upload it. Maybe if the client did that for me I’d be happy enough, but I having better sync APIs (including ways to describe the entities being synced beyond just a root CID) sounds great too. At the moment things it’s a much smaller developer lift to store the various versions of dist.ipfs.tech with kubo/ipfs-cluster nodes or anyone else supporting the Pinning Service API (unless they have behaviors like downloading and/or charging you for full the full DAG for each update).

boris · February 22, 2024, 5:43am

My point was not to diss CAR.

As I said, we’ve implemented CAR go mirror over https transport because it solves the problem of being performant and being reliable on all clients, including having low battery usage on mobile.

My observation is that commercial operators tend to be pretty far away from protocol advancements.

Look at all the Qm CIDs out there!

And end users (devs) will use things that fit together and meet their needs, and those that unlock new use cases / features.

alanshaw · February 22, 2024, 10:43am

I’m misunderstanding but right now you have a GraphQL-like API where the request data is in the body

So, the body IS the signed UCAN invocation. It includes the capability (can), resource (with) and parameters (nb). There’s no need to split these things out and put them in the URL - you’d lose verifiability of the requested invocation (unless you can repeat them in the UCAN).

You can (PUN INTENDED) put a UCAN in an Auth header, there’s a spec for that: GitHub - ucan-wg/ucan-http-bearer-token: UCAN-over-HTTP-Header Specification We’re using that spec in the HTTP bridge that @Gozala mentioned above.

In particular, some storage providers (e.g. web3.storage) have built their infra to optimize around storage of CARs

Just to clarify:

web3.storage built it’s infra around storage of content addressed data. CAR files are a reasonable way to send IPFS DAGs around today. If you send web3.storage a CAR file the blocks inside can be indexed and served over existing IPFS transports like bitswap and via IPFS gateways.

The system is designed so that you do not have to send CAR files. You tell web3.storage you want to store a CID (typically the hash of a CAR file) of a certain size (in bytes), and you get back a signed URL allowing only the specified number of bytes that hash to the given hash to be uploaded.

In the future, maybe web3.storage (or Filecoin!?) will do something smarter with non-CAR uploads. You can actually send non-CAR data to web3.storage today, but the service won’t do anything with it…yet, so I’d not recommend it.

including to the point of handing users back proofs that will allow them to see their CARs being provably stored by Filecoin SPs.

Hmm, not sure what you’re getting at here, the (PoDSI) proof has nothing to do with the type of the data. There is no mention of CAR files here: FIPs/FRCs/frc-0058.md at master · filecoin-project/FIPs · GitHub

…but yes web3.storage does expose proof of data aggregation to users, and through UCAN receipt chains they can verifiably trace their upload through the entire aggregation pipeline.

the lack of support for the Pinning Service API (or really any sync tool) would put the onus on me to figure out what data was already stored by me with web3.storage compute the delta locally and then upload it. Maybe if the client did that for me I’d be happy enough

We need smarter clients. I don’t think we should expect remote services to perform unbounded work to determine and retrieve a sub-DAG whose size is unknowable from a remote node(s?) that may not be reachable.

adin · February 22, 2024, 3:51pm

Sure you can repeat the information or it can be implied from the path (i.e. before you compute the hash used for the signature you use the data from the path + query parameters). The reason to split these things out is, as mentioned, because the core of the API (e.g. asking for somewhere to dump a file) is unrelated to the permission system (UCAN) and so you’ve built an API that others cannot use without taking your permissioning system too which reduces compatibility and the ability for users to switch between providers.

Looking at that PR, it looks like the API for /bridge is special (i.e. doesn’t really match the non-bridge API, which is my point). You could’ve made it the same API for the bridge and not by computing the UCAN. This would also be the kind of things others who are leveraging similar patterns could reuse. Similarly, for people who prefer REST-like vs GraphQL-like semantics (AFAICT most other services have gone with REST-like behavior).

Not with the encoding, but I mentioned that in relation to supporting “sync” behavior. If I asked you to only store the delta between old-dag(s) and new-dag the proof would be either less useful or much more annoying for the user to verify.

Who said anything about unbounded, or fetching data from unreachable nodes? I understand you’re coming from concerns about the current pinning service API (some of which I disagree with/don’t think are as problematic as you do), however that’s not the only option.

I put up a use case: a directory that evolves over time and I want old and new versions to both be available without being charged for the storage of a full clone of the data, or having to do a bunch of manual tracking work myself. A CAR upload API simply isn’t enough for this.

@boris similarly expressed interest in sync rather than just upload. He’s probably thinking more generally but consider that WNFS supports versioning which would be a big pain with a CAR-upload-only API just as in my example.

boris · February 22, 2024, 4:25pm

Back on topic: @danieln shared some feedback and wants to be able to confidently point developers at an approach to uploading to IPFS.

And: that multiple commercial providers had different approaches.

I think that product teams are going to pick what works for them for a variety of reasons. I’d love to pay someone to take my synch and persistence business, but so far teams haven’t found that a compelling market segment.

(Maybe back off topic from here )

As newly independent entities, what areas do we work together on? Not sure.

I’d love to maybe do something like look at what our engineering dependency trees look like from a library / module perspective.

I am interested in what it looks like for clusters of hosts to coordinate around IPFS Mainnet.

This is all emergent stuff.

olizilla · February 22, 2024, 5:53pm

No. web3.storage originally required data to be uploaded as a CAR to ensure that the users created their own dags and derived their own CID for the data, that we then promise to store for them. Otherwise we’d be just another centralised upload service that you’d have to trust.

Now you can upload any bytes to us as long as you also tell us the CID for those bytes in advance. The tools we provide encode your files as CARs, which also works nicely with being able to aggregate them into Filecoin deals in a way they can verify it.

olizilla · February 22, 2024, 6:06pm

We want to provide better support for this. It sounds like @boris will soon too. We should talk!

From the w3s side it has been hard for us to prioritise this use case. I ran the numbers while we were still using ipfs-cluster+kubo as the main store and for ~1PiB only about 0.1% of it was de-duplicated data. I don’t doubt the use case, and the opportunities to speed up uploads with diffs, but our users were almost entirely uploading novel content.

Topic		Replies	Views
IPFS for publishing research data: CAR files? Ecosystem and Usage research , archive	13	776	July 20, 2022
All-in-one Docker image with IPFS node best practices Ecosystem and Usage kubo , docker	41	3566	December 12, 2024
How to upload file to IPFS with only front-end	31	7453	May 5, 2022
Work-plans for kubo, helia, & other Shipyard IPFS projects in 2025 kubo , helia	12	329	December 12, 2024
How to retrieve content uploaded via Helia using the IPFS gateway? Helia	16	6037	October 28, 2023

Uploading CARs and user generated CIDs

Why CAR uploads matter?

Where do we go from here?

Side note

Related topics