Cross-compatibility of IPFS and BitTorrent with WebSeeding

NCGThompson · September 11, 2022, 10:15pm

tl;dr: Get BitTorrent clients to download and seed over both BT and IPFS, while reducing redundancy. We can use the same techniques to deduplicate or revive dead CIDs in general. Feedback appreciated.

For background, GetRight style WebSeed links (BEP 19) are just regular web links that a BitTorrent client can download from:

Many websites that list a BitTorrent download also provide a HTTP or FTP URL for the same file. The files are identical. A WebSeeding BitTorrent client can download from either source, putting all the parts together into one complete file. … Clients that do not support HTTP/FTP Seeding would still get the benefit from peers sharing pieces originally from an HTTP/FTP server.

As you can see it is a pretty effective deduplication technique. The data can be downloaded via one protocol then be deterministically seeded via another. The standard notes that the links don’t all have to be http or ftp, and any protocol (such as ipfs) can be placed before the ://. If the client doesn’t recognize the protocol it will just move on to the next link.

Let’s say you want to publish a large file via IPFS to many people. You know that many of the users don’t have an IPFS client, and all of them downloading it over a public gateway could put a lot of strain on the servers. To reduce the load on the server, you create a magnet link from the gateway link and publish them side by side.

While I don’t think any BT clients support this now, they could have an extension that diverts ipfs links to the local gateway, just like the official browser extension does. I think we can make the client even smarter, however.

If a WebSeed link in the .torrent file is to a specific CID (as opposed to DNSLink for example), then it can be taken as a hint that the CID is “canonical” in the .torrent creators opinion, and that there are other references to the CID (or another CID that shares files). Therefor, it could be useful to pin the file(s) with IPFS as well as seed them with BitTorrent. The question is, how do we do this while avoiding redundancies?

In terms of storing the file, deduplication can be done with Kubo with filestore and read only mounting. Many BT clients have analogous features as well.

Eliminating redundant downloads is a bit more complicated. By the definition of BEP19, we can use the .torrent to convert any IPFS data we have to BT data, and seed it. Because of that, it would be trivial to PIN and download all of the IPFS data, then seed it as BT. However, we want to be able to download over BT then PIN as IPFS as well. The simplest solution I can think of is downloading all the IPFS metadata but the leaf nodes, then use the metadata to recreate the leaf nodes from the BT data.

When a BT client begins downloading a Torrent, it should first start the BT protocol as normal. Asynchronously, it should read the WebSeed links to see if there are any IPFS CID’s. Then the client should instruct the gateway to pin and download the directory paths and metadata of each file. If the metadata contradicts the .torrent, then the path should be unpinned and ignored by the client. If the torrent contains multiple files, then per BEP19 the linked CID should be the root directory (or contain it; see specs for details; I’m not sure about the details myself) of the torrent and contain all the file paths listed in the .torrent. It does not specify that linked root directory contain only the file paths listed, and extremely large shared directories need to be handled appropriately.

Next, all blocks that are necessary to reconstruct file in IPFS should be downloaded (the client may try guesses on how to reconstruct it first if it already downloaded the file with BT), and then evaluate peer availability on each of the protocols to decide which parts of the files should be downloaded with each protocol. Pieces downloaded with BT should be attempted to be added to IPFS to get the correct hashes. If the client is unable to reproduce the CID with BT data, it will redundantly download the same data over IPFS.

Finally, the hashes in the .torrent are highest authority. All data should be checked against those hashes and rejected if they don’t match. If an IPFS file does not match the .torrent hashes, then it should be unpinned.

I don’t think Kubo comes the features necessary to pull this off efficiently, however generic CID recreating software should be created anyway, and used as a dependency.

Does anyone have any ideas, or know of anything like what I’m describing?

adin · September 13, 2022, 4:26am

IMO this is very doable whether Kubo can do this very efficiently or whether it’d be more efficient to leverage existing IPFS library code (e.g. the Go code that goes into Kubo or some of the alternatives in JS or Rust). There are already a number of IPFS implementations out there so making a new one out of similar parts doesn’t seem like a stretch.

As you mentioned the strawman version of this is pretty easy already you can run a kubo node and use a local (or public) HTTP gateway and call it a day . The complexity comes with making sure your bytes on disk are deduplicated and the UX for synchronizing the “keep my stuff” concept between the IPFS and BitTorrent implementation.

One way to get started here if you wanted to handle serving data over common IPFS protocols (e.g. Bitswap) as well might be to fork/leverage one of the multiple existing BitTorrent implementations and integrate the IPFS pieces into it. I’m pretty sure there are a number of Go-based BitTorrent clients that would make plugging in some of the Go IPFS tooling not too rough. There’s a bunch of funding and grants going into IPFS implementations these days too if that’s something you’re interested in.

As an aside you might be interested to know that there’s another way that IPFS compatibility with BitTorrent might be achieved. Namely, you could take the merkle-tree used for BitTorrent data and work with that data via IPLD rather than strictly the UnixFS file format most people first encounter when working with IPFS.

This would mean that instead of requiring an ipfs://<some-cid> in the webseed you implementations downloading torrents (even without webseeds) could check if <bittorrent-infohash> was available using IPFS tooling (e.g. someone advertised it in the IPFS Public DHT and made it available over Bitswap).

For example, if you look at the bottom of my post in IPLD and IPFS - A Pitch for the Future ⚾ you can see an example of loading a file from inside of a BitTorrent directory referenced by the BitTorrent infohash. It’s still early work, but if you’re interested in pushing in that direction there are likely a bunch of people (including myself) who can give you some pointers on where to go from here.

Topic		Replies	Views
Downloading from multiple sources Help	1	705	December 18, 2020
Order of block retrieval Ecosystem and Usage	0	392	November 13, 2018
Convert Bittorrent to IPFS files/folders Ecosystem and Usage use-cases-and-apps	9	3079	February 17, 2021
Why use "/ipfs/" instead of "ipfs://"? Ecosystem and Usage	5	1393	March 20, 2018
Magnet link in IPFS Help	9	1341	May 23, 2017

Cross-compatibility of IPFS and BitTorrent with WebSeeding

tl;dr: Get BitTorrent clients to download and seed over both BT and IPFS, while reducing redundancy. We can use the same techniques to deduplicate or revive dead CIDs in general. Feedback appreciated.

Related topics