tl;dr: Get BitTorrent clients to download and seed over both BT and IPFS, while reducing redundancy. We can use the same techniques to deduplicate or revive dead CIDs in general. Feedback appreciated.
For background, GetRight style WebSeed links (BEP 19) are just regular web links that a BitTorrent client can download from:
Many websites that list a BitTorrent download also provide a HTTP or FTP URL for the same file. The files are identical. A WebSeeding BitTorrent client can download from either source, putting all the parts together into one complete file. … Clients that do not support HTTP/FTP Seeding would still get the benefit from peers sharing pieces originally from an HTTP/FTP server.
As you can see it is a pretty effective deduplication technique. The data can be downloaded via one protocol then be deterministically seeded via another. The standard notes that the links don’t all have to be http
or ftp
, and any protocol (such as ipfs
) can be placed before the ://
. If the client doesn’t recognize the protocol it will just move on to the next link.
Let’s say you want to publish a large file via IPFS to many people. You know that many of the users don’t have an IPFS client, and all of them downloading it over a public gateway could put a lot of strain on the servers. To reduce the load on the server, you create a magnet link from the gateway link and publish them side by side.
While I don’t think any BT clients support this now, they could have an extension that diverts ipfs links to the local gateway, just like the official browser extension does. I think we can make the client even smarter, however.
If a WebSeed link in the .torrent file is to a specific CID (as opposed to DNSLink for example), then it can be taken as a hint that the CID is “canonical” in the .torrent creators opinion, and that there are other references to the CID (or another CID that shares files). Therefor, it could be useful to pin the file(s) with IPFS as well as seed them with BitTorrent. The question is, how do we do this while avoiding redundancies?
In terms of storing the file, deduplication can be done with Kubo with filestore and read only mounting. Many BT clients have analogous features as well.
Eliminating redundant downloads is a bit more complicated. By the definition of BEP19, we can use the .torrent to convert any IPFS data we have to BT data, and seed it. Because of that, it would be trivial to PIN and download all of the IPFS data, then seed it as BT. However, we want to be able to download over BT then PIN as IPFS as well. The simplest solution I can think of is downloading all the IPFS metadata but the leaf nodes, then use the metadata to recreate the leaf nodes from the BT data.
When a BT client begins downloading a Torrent, it should first start the BT protocol as normal. Asynchronously, it should read the WebSeed links to see if there are any IPFS CID’s. Then the client should instruct the gateway to pin and download the directory paths and metadata of each file. If the metadata contradicts the .torrent, then the path should be unpinned and ignored by the client. If the torrent contains multiple files, then per BEP19 the linked CID should be the root directory (or contain it; see specs for details; I’m not sure about the details myself) of the torrent and contain all the file paths listed in the .torrent. It does not specify that linked root directory contain only the file paths listed, and extremely large shared directories need to be handled appropriately.
Next, all blocks that are necessary to reconstruct file in IPFS should be downloaded (the client may try guesses on how to reconstruct it first if it already downloaded the file with BT), and then evaluate peer availability on each of the protocols to decide which parts of the files should be downloaded with each protocol. Pieces downloaded with BT should be attempted to be added to IPFS to get the correct hashes. If the client is unable to reproduce the CID with BT data, it will redundantly download the same data over IPFS.
Finally, the hashes in the .torrent are highest authority. All data should be checked against those hashes and rejected if they don’t match. If an IPFS file does not match the .torrent hashes, then it should be unpinned.
I don’t think Kubo comes the features necessary to pull this off efficiently, however generic CID recreating software should be created anyway, and used as a dependency.
Does anyone have any ideas, or know of anything like what I’m describing?