As I understand it it’s possible to add a file (to the IPFS network), get a hash, remove the file, and then add the file again after an arbitrary amount of time (10 days, 10 years, etc) and get the same exact hash, meaning the old/original hash URL will still work once you re-add the file to the network, which seems to mean that there is a unique hash for every possible file.
My question is, how can that work if there are astronomically more possible files then there are hashes (even on very old filesystems)?
ext4 has a max file size of 16 TiB, and if google calculator is correct that’s 1.407e+14 bits, so “2 to the power of 1.407e+14” possible files, while possible hashes are usually something like “2 to the power of 128” (340.282.366.920.938.463.463.374.607.431.768.211.456 possible hashes) or “2 to the power of 256” (115.792.089.237.316.195.423.570.985.008.687.907.853.269.984.665.640.564.039.457.584.007.913.129.639.936 possible hashes).
Even if you go down to a more responsible size like say 10 MB, that’s 80 million bits, or “2 to the power of 80 million” possible files, again: way more files then there are hashes.
So:
Does IPFS have a unique hash for every possible file (which my amateur math seems to indicate is not possible)?
Is my math simply wrong and we have more than enough hashes for every possible file?
Does IPFS perhaps do some trickery to have enough hashes for not all files but simply the number of files that realistically will ever be added to the network?
Something else?
A more simple and general question: is it possible to have a hash/id schema that could generate an id for every possible file?
With SHA256 it’s unlikely that someone will find an SHA256 collision – IPFS’ current hashing algorithm – any time soon. However, if someone does find a weakness in SHA256 then the network can move to a different hashing algorithm instead since IPFS uses multihashes.
To quote whyrusleeping from the linked thread:
Lets assume that the entire bitcoin mining economy decides to try and find an ipfs object hash collision, checking hashes at a rate of 400 Petahash (400,000,000,000,000,000 hashes per second) it would take them 2.810^59 seconds, or 910^51 years to compute the entire space.
As a side-note:
There are even more multihashes than just the one for the whole file. For larger files, they are broken into chunks behind the scenes and every chunk gets its own multihash as well.
That’s correct from the theoretical or mathematical point of view. But from the practical or physical point of view the number of different files in our little world is far less than the number of 256 bit hashes.
Another way to look at this is to generate a file and upload it to IPFS you need time and energy. That’s a small amount, but when multiplied by 2^256 it becomes mind blowing. If we could do a ipfs add instantly, with 0 time and energy, we could find a hash collision in 0 time
2 to the power of 128 … or 2 to the power of 256" possible hashes
That would give 2^256 = 1.1579209e+77 possible hashes, according to the same online calculator you were using (leaving scientific notation just to have an idea of the order of magnitude).
I think it is ok to calculate number of possibilities for each element ^ number of elements.
We have actually 45 elements, given that the first one of the 46 characters is always Q, right?
Each one of those 45 elements can assume any value inside [a-zA-Z0-9] to say it in Perl-ish, is that also correct?
Which means 26+26+10=62 possible values
62^45 = 4.5459644e+80 possible hashes which is about 3925.97 times 2^256
How many bytes would equal or exceed those possibilities?
1 byte 0-255 that is 256 possible values
ln(62^45)/ln(256) = 33.4923542459
I must be doing something wrong, 34 bytes would exceed the number of possible hashes I had found:
256^34 = 7.5885504e+81
Better mathematicians wanted.
Wait a minute, of course there’s no chance that a hash code based on 45 characters which value is in [a-zA-Z0-9], meaning 62 possible values for each character, even offers as many possibilities as a sequence of bytes of the same length, each of which has 256 possible values. But that’s the whole point about hash codes and the function which computes them. I guess we need to read about that