For anyone wondering, the security problem is that bitswap needs to download entire blocks (in this case, git objects) before it can verify their hashes. Unfortunately, git doesn’t break large files into multiple smaller objects. One of the goals here is to be compatible with git hash-for-hash (commit-id for commit-id) so we can’t just change the underlying git object format to keep them small.
Possible Solutions
In case anyone wants to help solve this problem, here are a few possible solutions to get the discussions started:
Fail
Given that one shouldn’t be storing large objects in git, we could just fail to checkout such repositories. Unfortunately, GitHub has set it’s max-object size to 100MiB so, if we want to support all GitHub repos, we’d have to support downloading 100MiB blocks .
Trusted peers
An alternative is to only download large blocks from trusted peers (e.g., github). This is, unfortunately, not very decentralized.
Take a vote
One possible (but not great) solution is to trust that N independently selected peers won’t collude. If we do, we could have peers split large blocks up into a merkeltree of smaller blocks and then poll a randomly selected set of N peers for the hash of the root of the merkeltree that corresponds to the block we want. If our randomly selected peers agree, we would then download the entire merkeltree, put it back together, and verify that the reconstructed block matches its hash.
While this would make fetching large blocks slow, this shouldn’t happen that often.
Extra metadata
We could also require that repositories with large objects also store some extra metadata that allows us to validate these large objects piecewise. That is, we can take the metadata from the “take a vote” solution and check it into the repository. Then, when downloading the repository, we could download the small blocks first, pull out this (small) metadata, and use it to validate the large blocks piece-wise.
IMO, this would be more trouble than it’s worth.
Crypto Magic
Ideally, we’d be able to progressively validate SHA1 hashes of large objects as we download them by exploiting the fact that SHA1 is a streaming hash function (i.e., somehow verify it in reverse). Unfortunately, I’m pretty sure there’s no way to do this securely.