How to calculate file/directory hash?

Iā€™m interested in calculating IPFS hashes for various files and directories, somewhat similar to ipfs add -n. Iā€™ve searched around, but wasnā€™t able to find any solid documentation on how this is calculated exactly.

Iā€™ve had a quick look through code and tried figuring it out from experimentation, and have some basic idea (files are broken into 256K chunks, wrapped in some protobuf format, hashed using SHA256 and there are index pieces which join together these small pieces, in a tree-like fashion, from what I can tell). However, is there some documentation/specification which explains all of this a bit better?

Also, is there a way to add a file via stdin?

3 Likes

Could you clarify the question? ā€˜ipfs add -nā€™ does exactly what youā€™re asking ā€“ it calculates the hashes of whatever youā€™re adding. The ā€˜-nā€™ flag tells ipfs to just calculate the hashes without actually adding the files.

Are you asking how to do this in code rather than from the command line?

is there some documentation/specification which explains all of this a bit better?

Try reading the tutorials at https://dweb-primer.ipfs.io

3 Likes

Yes, sorry for not making that clear. To put it another way, is there some specification/guide on how I would write a program to basically do the same thing as ipfs add -n ?

Thank you for the response by the way.

1 Like

I donā€™t know the specifics of how the current default behavior of ipfs add -n is implemented, but depending on your use case it might be worth noting that the hashes generated by go-ipfsā€™s add command for a given file/directory are dependent on a variety of factors (where the defaults might change). For example, the output from ipfs add -n depends on

  • chunking algorithm (--chunker option)
  • DAG format (--trickle option)
  • CID version (--cid-version option)
  • hashing algorithm (--hash option)
    • SHA-256 was/is the default but I think the default might be moving to BLAKE2b
2 Likes

Oh, so a file could have many possible hashes? And how hashes are calculated isnā€™t yet standardised/stable? That was something I was unaware of, so thank you very much for the information!

I realised that the multi-hash system could allow the same file to have different hashes, but I was under the impression that thereā€™d be a ā€˜standardā€™ hash (say SHA256) everyone would adhere to thatā€™d rarely change.

But if hashes can wildly vary for the same thing, doesnā€™t this somewhat reduce the effectiveness of content based addressing?

1 Like

As I understand it, the IPFS hash is a self-describing hash, i.e. a node will know from the format which algorithm was used, at which length etc. As long as you have the latest version installed, you should be fine. The latter was a problem for me, when I tried accessing the Turkish Wikipedia mirror, which had been created using v0.4.9, with go-ipfs v0.4.8; that didnā€™t work until I had v0.4.9 installed. But under normal circumstances, a node should recognized how a hash was calculated.

1 Like

yeahā€¦ as I understand it the first two letters(ā€œQmā€ most commonly found in IPFS hashes) state how a file is going to be hashed. https://github.com/ipfs/faq/issues/22

So if you used a different hash function the first two letters would be different which should be recognized automaticallyā€¦

1 Like

Thanks for the responses, but from what leerspace is saying, this isnā€™t the only issue.

The multi-hash format only provides a wrapper around the hash algorithm, so you can change from SHA256 to BLAKE2b, and the prefix will differ. HOWEVER, the hash is also affected by the chunking algorithm, DAG format and CID version, so you can have completely different hashes even if the format is marked the same.

This means that the same file can be duplicated across the network (goes against the claim of ā€œzero duplicationā€ on the IPFS home page), and this doesnā€™t even seem to be a rare case. I mean, someone could add a file, remove it, upgrade the IPFS client (or use a different one), add it again, and get a completely different hash.

Or have I misunderstood something?

yesā€¦ what you are saying could happen. However, if this happens most likely the user is forcing the duplication, as knowing how to change the hash algorithm, chunking algorithm, DAG format and CID version is more complicated and time consuming than most users would have time to do it, without substantial incentive.

So assuming that IPFS doesnt change the default settings frequently, the probability of existence of multiple hashes for the same file is low.

Further, I donā€™t think each file will only exist once in the network (zero duplication). For the same hash, multiple peers could get the same file and pin it to their local drive to duplicate it. Duplication of files leads to redundancy which I think is necessary and something users would want.

Maybe somebody could correct me, if I am incorrectā€¦

1 Like

Sidenote: Iā€™ve installed the prebuilt multihash binary here, and itā€™s giving me an IPFS hash different from running ipfs add -n /path/to/FileOrDir, both in default setting (i.e. sha2-256). I assume this has to do with the changes made to hashing at one of the earlier ipfs updatesā€”I believe it was 0.4.8 to 0.4.9ā€”, and the standalone multihash hasnā€™t been changed yet.

1 Like

Multihash just hashes the file and prints out it hash in multihash format.
ipfs add creates merkeldag and wraps the file in needed format, that is why the hash is different.

1 Like

Thank you for your responses.

users would have time to do it, without substantial incentive.

If thereā€™s little incentive to change those, whatā€™s the rationale for having them as switches? Are they perhaps mostly for experimental purposes and no-one is actually expected to use them?

So assuming that IPFS doesnt change the default settings frequently

This is a concern that I have. How often can we expect it to change? For one, it seems like the hash algorithm is already expected to change (SHA256 to BLAKE2b) - I get that IPFS is likely still in the experimental phase at the moment, but such a change could really impact an actual environment.
And this is ignoring potential other implementations.

If the current defaults are not expected to change, shouldnā€™t they be documented and/or written as a specification? This would allow others to implement alternative applications which are compatible, and donā€™t end up partitioning up the network. This documented hash should be exact, that is, contain no ambiguities/tunables such as a customizable chunking algorithm.
Unfortunately, flexibility and hashing donā€™t really mix, since hashing has to be exact. That is, the same input must always give the same output hash (and not 10 different hashes depending on settings used). I get that some flexibility may be desired, for example, if a cryptographic hash has been broken, and it seems that the multi-hash format is intended to deal with this, but such changes should be rarely performed, and considered somewhat breaking.

Is there such documentation available, or is it still being worked out, or something else?

Further, I donā€™t think each file will only exist once in the network (zero duplication)

I interpret ā€œzero duplicationā€ as referring to not having wasteful duplicates (or zero duplication in named resources). That is, if you and I have a copy of the same file, and person C wished to obtain a copy, he can obtain it from either or both of us.
However, if the hash generated by you differs from mine, there is now a duplicate resource on the network. That is, person C can no longer download the file from both of us, and if he tries to download from both hashes (since thereā€™s no way to determine whether the underlying file is identical), he will have two copies of the same file on his disk.

The way I always understood itā€”and please correct me if Iā€™m wrongā€”, is that it doesnā€™t matter what hash settings you use for a file that you add to the IPFS. If you add a file to your node using the current default sha2-256, itā€™s still the same file as when another user adds it to his node using the upcoming default blake2b-256. Doesnā€™t the network treat both as the same object, independent of the hash settings? Which means, if you get/cat/pin an object by inputting a sha2-256 hash, the network will also grab parts of the object from the node that used the blake2b-256 hash?

If not, then, indeed, I second your reservations.

1 Like

The way I understand it from here: RFC: Future-proofed cryptographic hash values. Ā· Issue #1 Ā· jbenet/random-ideas Ā· GitHub

is that Qm encodes the hashing algorithm in hex format. however, the other factors that leads to the hash could result in duplication (Ref: RFC: Future-proofed cryptographic hash values. Ā· Issue #1 Ā· jbenet/random-ideas Ā· GitHub). Perhaps these parameters could be added to the code to reduce data duplication?

Also @jbenet seems to be aware of this issue of changing default setting too often, but perhaps a spec would help (Ref: RFC: Future-proofed cryptographic hash values. Ā· Issue #1 Ā· jbenet/random-ideas Ā· GitHub)

1 Like

Maybe my test isnā€™t valid for some reason, but trying to retrieve files added with one hash using a separate hash doesnā€™t work. Iā€™m not sure how this could possibly work, but I wanted to double check.

# Create random 1MiB file I'm pretty sure nobody else has
dd if=/dev/urandom of=rand_file bs=1M count=1

# add the file to ipfs with defaults
ipfs add rand_file  # gives me QmeHy1gq8QHVchad7ndEsdAnaBWGu1CAVmYCb4aTJW2Pwa

# generate a multihash for the same file using a different hashing algorithm, but don't make it available on ipfs
ipfs add -n --hash=blake2b-256 rand_file  # gives me zDMZof1kx7N1VLQa3bxVSk53tJLUSXzW4mUMyc57HqBqri94rava

# try to retrieve the file added to ipfs with sha2-256 using the blake2-256 version of the hash
ipfs get zDMZof1kx7N1VLQa3bxVSk53tJLUSXzW4mUMyc57HqBqri94rava  # this command hangs (doesn't work)

# try to retrieve the file added to ipfs with sha2-256 using the sha2-256 version of the hash
ipfs get QmeHy1gq8QHVchad7ndEsdAnaBWGu1CAVmYCb4aTJW2Pwa  # this obviously works
2 Likes

So there will be no backward compatibility once blake2b-256 is introduced as the default, i.e. files might wind up on the IPFS that are already there under a different hash. So @Nyan does have a point here.

PS: to reproduce on macOS use dd if=/dev/urandom of=rand_file bs=1m count=1
ā€¦with 1m instead of 1M.

1 Like

My understanding is that there will be backward compatibility in sense that the old hashes will continue to work.

However, it will introduce duplicate data if people continue to re-add files to IPFS even though they already exist under older/different multihashes; Iā€™m having trouble thinking of a common scenario for this in the theoretical future where content is addressed and distributed using its IPFS multihash. I donā€™t know what OPā€™s use case is, but I thought that one of the primary use-cases for IPFS was where one (maybe sometimes a few) people add the file into IPFS and many retrieve it from IPFS using its multihash (which will continue to work even if the defaults for adding new files change).

1 Like

If the world already used IPFS, this would seem like a stronger point, but with so many other distribution methods being used these days, a network which inherently allows such fragmentation doesnā€™t really seem like a solution to much, I would think.

Having only one person add a file to IPFS and then distribute the hash seems counter-intuitive to the notion of decentralization?

Is there much of a reason to allow such flexibility in the hash anyway?

1 Like

How so? This is similar to how BitTorrent works, and as I see it IPFS is at least as decentralized.

An example use case for this flexibility is the trickle DAG format (--trickle) which is generally more suited to video streaming than the default DAG format. Iā€™d imagine there are other DAG formats suitable for other types of content as well. Depending on the content, different chunking algorithms may also be better at reducing duplicated blocks.

As for why use a multihash instead of hard-coding a specific cryptographic hash into the protocol, I think this GitHub issue addresses that somewhat: RFC: Future-proofed cryptographic hash values. Ā· Issue #1 Ā· jbenet/random-ideas Ā· GitHub

As time passes, software that uses a particular hash function will often need to upgrade to a better, faster, stronger, ā€¦ one. This introduces large costs: systems may assume a particular hash size, or call sha1 all over the place.

1 Like

I was under the impression that IPFS was aiming to be more than just a Bittorrent alternative (correct me if Iā€™m wrong) - the home page seems to give the impression that decentralization is a key goal. However, if the intention was to have centralized ā€œIPFS trackerā€ type websites, and hence, basically be a copy of Bittorrent with a few niceties on top, then it seems perfectly acceptable.

Is there a good place which documents all of these? The help for ipfs add is fairly bare and doesnā€™t describe what a number of the options exactly do.
I tried searching for information about trickle, and came across this. The last comment opines that there really isnā€™t that much difference between the two. I donā€™t know how accepted that opinion is, but my intuitive understanding of a tree based hash is that it doesnā€™t really impede streaming/sequential access in any way?

I get the rationale for customizable algorithms and see the benefits. My main concern is rather how easily it is expected to change. For a stable network, it should ideally only be changed if it is absolutely necessary.

Thank you for the response!

2 Likes