Merkle tree root hash with lthash?

I have a combination question/suggestion. I’ve been wondering why the Merkle tree root doesn’t contain a hash of the content. It seems to be the source of some confusion and I can think of some situations where it would be nice to have. The confusion seems to be because the CID is a hash of the contents but only indirectly via the Merkle tree so without the entire tree there is no way to verify the hash of the complete file. The other is that the same file can hash to different values depending on various parameters like the chunker used. Unfortunately that means that I can add the same file twice and get two different hashes. That is what it is, but there isn’t any way to verify the they actually are the same file without retrieving both files, hashing the contents and comparing the hashes.

I get why you wouldn’t want to do that because recomputing the entire hash over and over again would get expensive but I could imagine that it might be possible using something like lthash https://github.com/lukechampine/lthash

If you had that you could determine if two files were the same by just retrieving the root node and comparing lthashes. If gateways gave you a way to retrieve node as well as files you could also retrieve a file’s root node verify the CID and compare the lthash and then you wouldn’t need to trust the gateway.

Maybe I’m completely missing something but I thought I’d throw it out there and if I’m missing something maybe someone would be kind enough to let me know what I’m missing.

2 Likes

I like the idea of including a hash of the original file to the root node of the DAG.

Currently if I query the root node by CID I get something like this:

$ ipfs dag get QmShzonja4XdJsWTwDtfA6so1LgEqxjBkffZg3LgcNEvf1 | jq
{
  "data": "CAIYgNDM0RAggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCAgOAVIICA4BUggIDgFSCA0OwT",
  "links": [
    {
....

I don’t see why it can’t include another field like original_hash

The nature of a “Merkle Tree” means that if you update a “node” on the tree structure you only have to re-generate the new hash of that one node, and then propagate up the tree to all parents, where ultimately the parent hash changes, and during that process you aren’t having to make a “full pass” over all the data which would be required to regenerate a SHA256.

Although no one stated it explicitly, this is probably the “true” technical reason that IPFS is not using “true” content addressing.