Last week on the Costa Rican hackerspace we started a project to serve the ubuntu archive on top of IPFS. The idea is that it should be faster to download deb packages from an IPFS node that is closer than the closest mirror of the archive.
For example, in Costa Rica there is only one mirror, hosted at the university in the middle of the country. On the jaquerespeis there are at least 50 Ubuntu users spread on different geographic areas. If we get all of them to use the ipfs mirror, we should start seeing faster downloads. Or at least, that’s the theory; the experiment is to see how it goes in real live.
We started writing a transport for IPFS that knows how to download IPFS URIs:
Now, we are trying to find resources to get a server and start seeding the full archive. This requires 2TB of storage, so it might take us some time to find somebody who donates the server
We also need to put that transport on a PPA to make it easier to install, and do more tests with more people.
So far, the only problem we have is that downloading stuff from the published IPNS is slower than using the hash directly. I don’t yet know if it will be slower than hitting the HTTP mirror. We will need to make measurements and more controlled experiments.
If you want to join the project, any kind of help will be appreciated. Specially, from Ubuntu users who would like to try it and share their bandwidth serving the debs on ipfs.
This should also work for Debian, but that’s another 2TB for the debian archives. So maybe we wait to get the first mirror online before trying that
So far, the only problem we have is that downloading stuff from the published IPNS is slower than using the hash directly.
IPNS is currently known to be rather slow. It has to do a DHT lookup and our DHT isn’t the fastest in the world (and DHTs tend to be slow in general). We’re working on making it better but that will take some time.
For now, you might want to consider using HTTPS to get the repo’s root hash and then fetch the actual data over IPFS.
You can also use something called dnslink to make /ipns/domain.name/ point to an IPFS hash but that’s only as secure as DNS is (although apt repos are signed). To use dnslink, add a TXT record in the form of dnslink=/ipfs/HASH to _dnslink.domain.name. /ipns/domain.name will now resolve to /ipfs/HASH using DNS.
I understand your http suggestion now. That would require a new apt transport, and the mirror to have also an http server in addition to ipfs. Both things are simple, so we can implement them if the first measurements with ipns are bad.
I will investigate more about bitswap, because once we have the first mirror rsync’ed, it sounds awesome to sync the rest through IPFS
@koalalorenzo@leerspace the ubuntu archive is currently 1.1TB. I’m calculating 2TB to leave room for more space when the new LTS release comes out in April. Other than that, have ipfs installed and give it part of your bandwidth, ideally 24/7.
That is for a full mirror. I’m thinking that before we start making measurements in real life, it would be nice to have at least 3 full stable mirrors. But even smaller servers that host only some of the directories will be very useful once we set up our first full mirror.
Here is more information about how to sync a local mirror: https://wiki.ubuntu.com/Mirrors/Scripts
I have one in progress in my house, but with my slow Costa Rican connection, it will take at least a month to finish. Any help to speed that up would be amazing
If someone posts an ipns address or ipfs hash here for the full archive (in IPFS) I can pin it to at least one node in the midwest US. It’s admittedly not very close to Costa Rica, but it should hopefully be better than nothing.
@elopio: How do we manage deduplication? Are the packages having always the same hash or each time we mirror, we get a new hash? As soon as I am ready, I am trying out it, but I want to be sure that it works correctly so that multiple repositories can share the same packages (as they have the same hashes).
Every time we rsync the mirror, some of the files will change, so the hash for the directory will change. The current solution is to assign the ipns hash of the node to the hash of that directory. Like this, but with a full mirror, not just one package:
The files inside the mirror are resolved by path relative to the hash of the directory. If the rsync didn’t change them, their hash will remain the same.
As I mentioned before, it is currently slow to resolve the ipns. I don’t yet know if this delay is enough to make it painful and worse than http in average, so I think we should explore this option first because it’s the easiest and most transparent. Maybe we can even give a hand to the devs to make ipns faster
After we test and measure this, we can explore other solutions, like an ipfs client that works like an http cache, a central index of deb names->ipfs hashes, or saving the mirror hash in a faster protocol than ipns as @stebalien suggested. I think it will be useful for future projects to have tools and experiences on all of these solutions.
But feel free to disagree, we are open to try the possible solutions in a different order. And maybe there are other solutions that we haven’t thought yet.
We had a little chat on the #ipfs-cluster IRC channel. I’m not yet sure what’s the best role of the cluster here. They told me there are a few features on the backlog that could help spreading the load while bootstrapping a mirror. Also, once we have a full mirror, we could turn them a cluster master so the rest can sign up as cluster slaves instead of being independent servers. I’m very happy to play with these options too, but again, it seems to me that the first step is to have one full mirror.