Work-plans for kubo, helia, & other Shipyard IPFS projects in 2025

cewood · December 6, 2024, 11:57am

Hi everyone

We’re excited to share a quick update about our 2025 work-plans for IPFS at Shipyard! Your feedback and ideas mean a lot to us, and we’d love to hear your thoughts to make these plans even better.

Feel free to drop your comments or suggestions here—we’re all ears!

Let’s make 2025 a breakthrough year for IPFS, together.

Cheers,
Cameron.

blurpesec · December 6, 2024, 9:30pm

Are these lightweight clients intended to be light enough to run fully P2P within something like a chrome extension?

mosh · December 6, 2024, 10:04pm

Excited for the push to decrease dependencies on gateways for the public network!

Can we hear more about support for non-UnixFS data? I know there have been repeated calls for better support of large blocks, do these overlap at all?

I’m really interested in how kubo, helia, and Shipyard could contribute to more IPFS adoption outside of web3, like implementing some of the streamlined tools and tests discussed in the recent CID Congress meeting.

willscott · December 6, 2024, 11:33pm

I worry that kubo continues to be pulled in a lot of directions. Features for large pinners, like ‘provide to IPNI’ are different priorities than what’s wanted for desktop hobbyists. Identifying work within those distinct audiences might be helpful
I’m excited about potential multiplier of leveraging multiple levels of HTTP peers. I would value demos of webseed and browser fetching much more than python http-based libraries
I agree with Mosh that there seems to be a gap between shipyard trajectory and what what the foundation is thinking about. I would expand the ‘evolution of protocol specifications’ to try to capture that shipyard is a thought leader that should be driving the conversation around things like dazl.
Are there opportunities to shape any of the browser/http half of the work in products / services. It seems are could be complementary things to build as paid offerings, which would demonstrate the value of the tech
- Is there a CDN / ipfs-ification of a web publisher’s assets that could be offered as a service? (what’s missing for that)

markg85 · December 7, 2024, 7:18pm

I have an IPFS node with a couple TB of data running on a AMD threadripper (sure, first gen but it’s still a monster of a machine). That pc is too slow for providing. Accelerated DHT is not an option (link for reference, i won’t go into detail here).

This very same issue exists for anyone who would want to host a larger set of data with IPFS. Think of archlinux packages, npm for nodejs and so many others. This puts a higher bar of entry on IPFS where it should not be needed.

While i’m happy to see (re)providing is an issue that will be looked at in 2025, i don’t see the actual issue being looked at. Instead i see options being added that would help users that are not the average ipfs early adopters.

On a related note, IPFS (read kubo) is still a very huge CPU and memory hog. Even if you have no data added at all. But if you have a large dataset (or even just a lot of small files) then it’s just going to eat up your cpu and memory no matter the high end server you run kubo on. This has been known for years! It would be great if some effort would be put into making kubo behave itself. I’m guessing the providing mechanism is having a large effect here too.

ylempereur · December 8, 2024, 2:46pm

Hmmm, I guess it was kinda subtle, but you clearly missed it. Here is a quote from the document:

“Kubo can advertise to the Amino DHT in XOR order without loading large amounts of data into memory”

And here is where that new feature is discussed:

github.com/libp2p/go-libp2p-kad-dht

Reprovide Sweep

opened 01:06PM - 07 Mar 23 UTC

guillaumemichel

## Context Currently the Reprovide Operation is triggered by Kubo for each Pr…ovider Record. Kubo periodically (every [22h](https://github.com/ipfs/kubo/blob/8f638dcbcd875ecff92021e4b62d0af8848022ce/config/reprovider.go#L5)) republished Provider Records using `go-libp2p-kad-dht` Provide method. https://github.com/libp2p/go-libp2p-kad-dht/blob/b95bba8ddd70f68c6eb9df4faaff09695098870b/routing.go#L373 The DHT Provide method consists in performing a lookup request to find the 20 closest peers to the CID, open a connection to these peers and allocate them the Provider Record. ## The problem This means that for every Provider Record that a node is advertising, 1 lookup request needs to be performed every 22h, and 20 connections need to be opened. This may seem fine for small providers, however this is terrible for large Content Providers. The Reprovide operation is certainly the reason most large Content Provider don't use the DHT, and IPFS is forced to keep the infamous Bitswap broadcast. Improving the Reprovide operation would allow large Content Providers to advertise their content to the DHT. Once most of the content is published on the DHT, Bitswap broadcast can be significantly reduced. This is expected to significantly cut off the price of hosting content on IPFS, because all peers in the network won't get spammed with requests for CIDs they don't host. ## Solution overview By the pigeonhole principle, if a Content Provider is providing content for `x` CIDs, with $x \geq \frac{\\#DhtServers}{repl}$, then multiple Provider Records are allocated on the same DHT Servers. The optimization consists in reproviding all Provider Records allocated to the same DHT Servers at once. It spares expensive DHT lookups and connections opening. Without entering too much into details, all Provider Records are grouped by XOR proximity in the keyspace. All Provider Records in a group are allocated to the same set of DHT Servers. Perdiodically, the Content Provider _sweeps the keyspace from left to right_ and reprovides the Provider Records corresponding to the visited keyspace region. For a Content Provider providing 100K CIDs, and 25K DHT Servers the expected improvement is `~80x`. More details can be found on the WIP [Notion document](https://pl-strflt.notion.site/DHT-Reprovide-Sweep-3108adf04e9d4086bafb727b17ae033d). ## How to implement it The Reprovide operation responsibility should be transferred from Kubo to the DHT implementation. This is generally desired because different Content Routers may have different reprovide logic, that kubo is unaware of, or cannot optimize for. `go-libp2p-kad-dht` (and other Content Routers) should expose `StartProviding(CID)`, `StopProviding(CID)` methods instead of the `Provide(CID)` method. Kubo then only needs to pass to the DHT which CIDs should be provided or not. A lot of refactoring needs to happen around [go-libp2p-routing-helpers](https://github.com/libp2p/go-libp2p-routing-helpers). ## References * https://github.com/ipfs/go-ipfs-provider/issues/49 * https://github.com/libp2p/go-libp2p/issues/2175 * [Notion document](https://pl-strflt.notion.site/DHT-Reprovide-Sweep-3108adf04e9d4086bafb727b17ae033d)

I believe that’s what you are looking for.

markg85 · December 8, 2024, 3:31pm

Thank you for that link and highlighting the ...DHT in XOR...! Yes, that can work.

I’ll happily test it out once there’s a kubo release that does that. I’m cautiously worried though that the solution in practice only has a network benefit (More optimal DHT provides essentially). From a hardware resources point of view (cpu and memory) i have my worries that it might be even worse.

I’ll subscribe to that issue just to stay in the loop

hector · December 9, 2024, 8:58am

A pointer to the canonical discussion about block sizes: Supporting Large IPLD Blocks with people supporting different views:

Default is fine, problems come from inefficiencies in implementation
We should just 10x the limit while we figure something out
We should re-architect the system so that limits don’t play a role anymore

I think a lot of the focus goes indeed into making the browser a first-level citizen. There is value in being able to find providers and fetch content from any node via http not only for the browser though: additional language implementations got 100x easier by removing libp2p/yamux/streams from the equation. We hopefully see examples from the community.

masih · December 9, 2024, 10:15am

Kubo can provide to IPNI alongside the Amino DHT

I am excited to see this; question: does it mean the diff-style provide of the IPNI protocol or DHT style of TTL-based per CID provide specified in HTTP delegated routing specification?

I would also love to see the ambient indexer discovery in future milestones, with local-first ranking of IPNI indexer instances and transparent fallback onto alternative indexers in the IPNI federation.

bumblefudge · December 9, 2024, 10:50am

“Feasibility study on the tradeoffs around shifting traffic…to a service worker gateway”: Can we go one step further and scope some explicit performance and cost benchmarking in here? Like, not just “is it feasible to set up a routing delegator and pinnned-only provider for a project” and more “is it a 10 page tutorial or a 2 pager”, “what does a server that can stably/comfortable run it cost on hetzner”, those kind of things?
kubo providing subsystem upgrades – yaaaaay! music to my ears.
“Adding diagnostic screen to Web UI / IPFS Desktop” Yes! (sickos.gif)
“This likely requires partnership with a blockchain like Filecoin, Ethereum, Solana, Arweave, etc.” - This requires a bit of diplomacy, connections, and experience with the respective foundations/grant-funding channels of those ecosystems. Happy to help, CC me/include me wherever possible.
“Continued evolution of protocol specifications” - put me in there coach!.gif)

adin · December 9, 2024, 4:57pm

Yes. They should be runnable in web pages and service workers (e.g. https://inbrowser.link) as well as extensions. It’d be great if extensions supported custom protocol handlers pointing at service workers ServiceWorker-like protocol handlers for WebExtensions · Issue #212 · ipfs/in-web-browsers · GitHub rather than just to an HTTP endpoint, but that’s a longer term endeavor we’re working with browser vendors on.

adin · December 10, 2024, 3:31pm

Some great points and questions from folks. The forum prefers me to have a big post quoting everyone rather than lots of small ones, so here it goes .

They overlap some, but not entirely. Technically large raw blocks are valid UnixFS so just supporting large single-file blocks is a separate endeavor, which if that’s what people would prefer to support I’m certainly interested.

Regarding UnixFS alternatives the two biggest areas enabling UnixFS to have much wider support than non-UnixFS data are:

Support via gateway APIs / IPFS URI and public gateways like ipfs.io
More tooling for creating those DAGs from files (and generally lots of tooling out there around working with files vs other data representations)

So far it seems like 1 is a bigger issue than 2. People seem ok building their own hash-linked data structures for their applications (e.g. Filecoin, Bluesky, Ethereum, Solana Yellowstone, BitTorrent, …) however almost all of these tend not to interact with IPFS “mainnet” tooling. If those groups have an interest in bringing that data to mainnet then it’s something we can do, but if not then there are likely better things to work on.

A few related notes:

Many of these applications have blocks smaller than 2MiB and so interacting with the large blocks issue isn’t necessary. If the interested groups have large blocks then we can work on what that interoperability story looks like (e.g. in the large blocks discussion @hector linked).
For a lot of data types, particularly those that don’t require support for large blocks, the ability to do verifiable retrieval in the browser can empower these use cases at much lower cost than it used to.
To get from where we are today to supporting arbitrarily large blocks is a large endeavor that while I would LOVE to do it makes it IMO difficult to justify ahead of some of the other proposed items. However, a sufficiently compelling usecase / interoperability story (e.g. if the maintainers of tooling for pulling docker images / OCI containers, or any widely used package manager were interested in allowing us hooks for content addressable retrieval using their existing hashes, if iroh ↔ kubo interop was in high demand, etc.) would make doing this worthwhile, as would decreasing the amount of work required via other means.
- For example, the work around improving the downloading pipeline in boxo as well as equipping it with additional capabilities (e.g. HTTP-based downloading and webseeds) should drive down the cost of implementing safe large block retrievals such that it becomes easier to justify the work.

While I agree that kubo is still pulled in a lot of directions and it’d be great if there was time to diversify a bit to use libraries like boxo to build more differentiated and optimized applications (e.g. GitHub - ipfs/rainbow: A specialized IPFS HTTP gateway) I don’t think the content routing example is a good one.

Home users (e.g. @markg85 may be one of these) may also want to provide a bunch of data and require optimizations to the routing subsystem whether it’s enabling advertising less data successfully, being more efficient at DHT advertising, or leveraging IPNI for advertising. In some situations, such as a user with a lot of data and even reasonable upload bandwidth but they’re not able to open many simultaneous connections, using IPNI might even be best for them.

WebSeeds is likely to be significantly more work than setting up serving data over HTTP from a language without a ton of IPFS tooling, although I agree that it could be a great payoff.

As @hector noted, to some extent the latter is a litmus test on just how easy / difficult the HTTP tooling makes it for the wider community to make their own implementations. If all goes well (and the scope is limited to not include things like reimplementing UnixFS) this shouldn’t be too large an endeavor… if it is than I’d agree that it’d be worth re-evaluating if there’s a better place to invest time.

IMO there’s a lot to unpack in this sentence (and might be enough to spawn a new thread), but:

Certainly agree that Shipyard should be involved in driving conversations regarding IPFS’s evolution
A number of Shipyard folks were at the CID Congress meeting where dazl was discussed
At the moment there don’t seem to be many other folks focused on the health and growth of IPFS mainnet, which is in practice what most folks think of as IPFS and use today. I think in practice for our IPFS community work the highest priority “thought leadership” should be around IPFS mainnet followed by helping enable (self-)verifiable data that would be valuable if it made it onto mainnet. There are lots of other interesting conversations to be had around content addressable and verifiable data at large, but spending more time there means spending less time on topics related to mainnet which seems like a mistake.

Not that Fleek is the be-all and end-all here, but they’re certainly trying to do this and in use by a number of folks. As we work with dApps to leverage the verified fetch and service-worker gateway code paths I think we’ll learn more about the market’s needs are here.

I suspect internally the APIs for a given content routing system should look like Start/Stop providing and then leverage whatever makes sense per system. For the Amino DHT it will still do TTL-based reprovides per CID, for IPNI likely diff-based unless it’s easier to do TTL-based.

Note: TTL-based is likely still useful either way as a mechanism for making it really easy to build an IPFS mainnet compatible implementation without needing all the mechanics around generating and syncing the IPNI DAG.

+a lot. I think there are a number of avenues it’d be great to improve upon IPNI, let’s see how long it takes to get the other content routing work done first .

IIUC you are asking for something completely different. This feasibility study is roughly: There are nearly a billion requests to ipfs.io and dweb.link every day that are paid for by the IPFS Foundation (which is funding that could go towards things like making the implementations and protocols better and easier to use in furtherance of the project’s mission) and which are a source of centralization in the practical use of the network. What will the net experience shift be if we push browsers towards directly downloading the data from the nodes hosting it and doing the validation themselves?

This is a totally different study from one on the difficulty and cost around self-hosting your own data and relates to Mark’s point.

There’s a lot to unpack here and cross dependencies that I’m happy to discuss but likely are too much for here (my comment is enormous already), but TLDR:

HTTP tooling should help here in terms of both enabling new specifically optimized implementations and making it easier to use existing HTTP tooling to control rate-limits, etc.
Provider system reworking should ease the workload here
Data transfer work in boxo that’s related to (but not 100% the same as) the HTTP work should be able to reduce a lot of unnecessary workload.
- As an anecdote we run rainbow rather than kubo to back ipfs.io and dweb.link and the resource usage is significantly better. Some of this is due to some hacks that were implemented in rainbow that would not be reasonable for kubo but that ultimately work around some issues in boxo in long need of resolution

@bumblefudge as this work completes I think asking some folks to tell us how easy / difficult it is to self-host data would be super useful.

danieln · December 12, 2024, 10:31am

There’s already a lot here and both @hector and @adin shared some good insights. I’d like to add a couple of points.

Mainnet as a shared participatory global network is one of the most useful things about IPFS. Alas, there are reasons why Mainnet has problems, but recent advancements have alleviated many of those. I’d say we’re about 70% of the way there. Granted, there are inherent trade-offs with the Mainnet approach, like increased latency and the overhead of relying on a “forgetful” DHT. But the global namespace/singleton nature of IPFS Mainnet makes it unique and unlike any other protocol/network out there.
We already support a lot of “non-UnixFS” use-cases in the implementations/tooling Shipyard maintains. Albeit without support for large blocks and incremental verification. I’ll leave this for the moment, but I agree that this is an important project. We should tackle this collaboratively with a real-world use-case .
There’s more work to improve developer experience around these use-cases. But in my view, such improvements should find the sweet spot between helping adopt CIDs in your protocol/app as well as on-ramps to IPFS Mainnet. For example, if an application developer chooses to adopt DASL CIDs, it should be easy to also make that data available through IPFS Mainnet (by either users or the app builder).
CIDs become especially useful if you can retrieve data without special knowledge about how your app leverages them. For this reason integrations with Mainnet are where a lot of the potential lies in my opinion.
I see a huge opportunity is combining WebSeeds (by which I mean http gateway endpoints with their data announced to either DHT/IPNI/app-specific delegated routing endpoint) with some of the new emerging use-cases like AT Protocol/Bluesky so as to make it easier to make data available on Mainnet.

What about UnixFS:

UnixFS is mostly useful for representing files and directories.
If you are just working with files (no directories), perhaps you can forgo UnixFS and just use hashes with raw data like AT Protocol does for blobs. It may take some work to for interop with Mainnet (incremental verification and all other aspects discussed in Supporting Large IPLD Blocks)
There are still challenges and drawbacks with UnixFS (some came up in CID Congress) that we need to address:
same data results in different CIDs, aka hash equivalency across different systems or CID determinism. (for all the reasons mentioned in Should we profile CIDs?)
Much more pre-processing (chunking) is necessary to get CIDs than just generating a raw hash.
Some find the dependency on protobufs (especially if they already depend on cbor) undesirable, particularly in web environments.
Once data is merkelised as UnixFS it often needs to be stored twice (this is an implementation detail, not a hard limit).

Many of these problems are soluble and the proposed work plans address many of these while being realistic in scope.

Would love to hear any feedback on this.

Topic		Replies	Views
What motivates people to use IPFS for large volumes of data? Ecosystem and Usage	32	8897	October 26, 2018
Native IPFS support in web browsers Ecosystem and Usage	9	4365	May 2, 2018
How can IPFS distribute dynamic content (private, server side, user-specific content like passwords)? (WIP) Help	13	4674	May 23, 2017
IPFS for community-led research Help research	6	2105	May 17, 2017
Dynamic website with user functionality Help	25	9341	April 9, 2018

Work-plans for kubo, helia, & other Shipyard IPFS projects in 2025

Related topics