How to calculate file/directory hash?

leerspace · July 20, 2017, 6:42pm

I only mentioned BitTorrent as an example of a decentralized system that follows the upload once and download many times workflow. Unless we’re not considering BitTorrent to be decentralized, I was trying to understand how “having only one person add a file to IPFS and then distribute the hash seems counter-intuitive to the notion of decentralization”. Anybody in the swarm who has/requests an IPFS hash can become a provider of the content it addresses.

I’m not aware of a single place that documents these all thoroughly but this would be helpful. I usually use this command reference to try to find existing commands that might do what I’m looking for, but in terms of completeness I think it’s the same as the help text output.

I’m not sure what that comment author’s experience includes, but based on this discussion around the implementation of the trickle-dag format there are some tangible differences: https://github.com/ipfs/go-ipfs/pull/713

Side note: I was trying to find this conversation since I’d stumbled across it before, but here’s a relevant discussion related to the hash flexibility discussion in this thread: Hash conversion for import/export and long term archival · Issue #1953 · ipfs/kubo · GitHub

Nyan · July 22, 2017, 11:17am

With Bittorrent, the system relies on one point of centralization: the indexer. Once the index goes down, no-one can find the content. Of course, if some have already downloaded the content, they can recreate torrent files and share them to those around them, however Bittorrent doesn’t have a strictly defined hash, the generated torrents may be different, thus the small torrent swarms may never be able to discover each other.

IPFS is probably better in this regard, if hash settings aren’t changed, as they should usually be the same, but it doesn’t seem to solve the underlying problem.

Unfortunately, it gives very little information on what trickle does. I haven’t read through your other link yet, but thank you very much for the information.

Thank you for the link. It seems like the aim of the project is different to what I had imagined. I suppose the following quote expresses my thinking at the moment:

Thanks again @leerspace for all the info!

alpha_T1D4x · July 23, 2017, 12:06am

Newer clients like Tribler have decentralized indexes: Truly Decentralized BitTorrent Downloading Has Finally Arrived * TorrentFreak

also some great discussions about this issue here: if the Permanent Web is “content-addressable”, could it be designed so that each file has only one address? · Issue #126 · ipfs/notes · GitHub

and avoid duplicating files added to ipfs · Issue #875 · ipfs/kubo · GitHub

es_00788224 · July 25, 2017, 10:12pm

In theory, the same file can have multiple hashes. In practice, mostly everyone uses the default settings.

Are they any good? How do they solve the spam problem?

Why can’t this be agreed on by some consensus, instead of letting every client do as they please? It should at least be strongly discouraged to use anything else than the standard function.

Currently, by default, the hash of a block is the multihash of the sha256 of the block’s data. For example, split file.bin; sha256sum x*. The hash of a file is the hash of the canonicalized IPLD object containing the chunks:
https://github.com/ipld/specs/tree/master/ipld

es_00788224 · July 25, 2017, 10:23pm

Just hash them again. This is a relatively small project, everyone is using go-ipfs right now. If you want to switch over to blake2b, just add a line of code in the next update that re-hashes all sha256 chunks in the datastore as blake2b next time the daemon is restarted. Then the old sha256 chunks are changed to point to the blake2b chunks until they get garbage collected. You could use both hashes at the same time during a transition period during which anything in the datastore is made available under both hashes, but anything new added will by default get outputted by the GUI as blake2b. Then when nobody is distributing sha256 hashes anymore, you just remove the dual hashing in the background.

Nyan · July 26, 2017, 1:17pm

…which, if they change, can result in multiple hashes…

There was a bunch of security issues reported about Tribler: https://lists.torproject.org/pipermail/tor-dev/2014-December/007999.html I have no clue whether they’re fundamental issues though, or whether they’ve been addressed.

From what I’ve seen, it appears to be more complex than your command seems to indicate, including a protobuf wrapper, and hashes organised in tree form?

The downside here is that it requires the file to be in the file store of the one making this change, and requires an expensive rehash operation. I can’t see many people willingly do this unfortunately.

I was originally considering the idea of hashing a file but not ‘seeding’ it, nonetheless distributing the hash. The idea being that if someone else decided to actually add the file to IPFS, the hash would become valid and content discoverable by everyone. Unfortunately, it appears that stability in the hashing system isn’t really a goal, so I guess IPFS doesn’t really suit my use case.
Thanks again for all the info given.

es_00788224 · July 26, 2017, 4:57pm

Yes, and I agree it’s a terrible “feature”, but in practice it’s rare for people to mess around with the settings.

I mean the indexing. There have been similar attempts before, and they’ve mostly been rubbish.

No, the hash of a block is just the sha256 of it. The hash of a file is a different animal, see above.

The default size of the file store is 10gb. You only need to rehash the files once every time you change the protocol. A rehash operation is I/O bottlenecked. A HDD reads at about 150 mbit. So it would take about 9 minutes to rehash a full default size file store. This only needs to be done once every few years. Wikipedia suggests your average hash function has a shelf life around 10 years. Since files are read sooner or later anyway, you could just re-hash them whenever you get around to reading them. If they’re not accessed, the garbage collector will get them, so whatever you do the old hashes will slowly be purged from the network. You don’t need the entire file, it can be done separately for each block.

It’s still in development. But you could just stipulate they use the default hash options, you need to go out of your way to change them and it’s rare for people to do so.

I agree with you, it’s not a good idea to allow people to change the hash options willy-nilly just because of “freedom”. But it’s no big problem in practice.

Nyan · July 26, 2017, 8:59pm

I think you misunderstand. I meant that the application’s default settings can change. For example, BLAKE2b is being considered as a new default instead of SHA256, without any manual intervention from the user (other than updating software). It’s unknown whether the other options will get new defaults in the future, but I haven’t heard any statement that they’re expected to rarely change.

Oh I see. But yeah, the hash of the file is what I’m after.

Try replacing 10GB with 10TB, unless you’re advocating that no-one would change such a setting? (though I’d argue that even 10TB is small if you’re looking a few years into the future)

Does somewhat depend on the hash and CPU, as well as whether a low power disk is being used. For large amounts of storage, it doesn’t seem unlikely for a low power CPU (say ARM) and low power, large capacity disk to be used. I get around 82MB/s for SHA256 on a 2.4GHz Silvermont Atom CPU here, which is definitely slower than the disk throughput rate. BLAKE2b is likely faster, but I don’t have anything to test it at the moment.

The concern I have is that SHA256 isn’t really out of its ‘shelf life’, that is, it’s not known to be weak, and yet, it’s being replaced.

For data you want archived, this may not be the best solution I suspect.

As well as a particular client and client version?

es_00788224 · July 26, 2017, 10:00pm

They’re not going to be changed frivolously though, that goes against common sense. So you can hash them with both SHA256 and BLAKE2b for long-term archival. They’re not going to remove support for SHA256. You can specify any supported hash function as the parameter for --hash.

Check the IPLD spec under Canonical Format, A Chunked File, and A Directory.

I’m discussing a mechanism for transitioning between hash types performed on a network scale. Most people wouldn’t have 10TB large file stores. If someone does have a 10TB file store, it can be done whenever there are spare CPU cycles and disk I/O, or whenever a file is read for other reasons. A 10TB file store would be unlikely to be filled though, since you only download content that you want to download, and inactive content that you haven’t pinned gets automatically deleted.

A regular desktop HDD can read at 170MB/s at most. Your 82MB/s figure, is that for all cores or single-threaded? The average node’s CPU is faster at hashing than their HDD is at reading, and I think there are very few nodes who are slower at hashing. For nodes that are faster at reading, this optimization won’t give them anything. But they already have to do the hashing once when downloading to check the data integrity. It can be done when the CPU is at low utilization or over a longer period of time.

If you want to archive data, you pin it. Pinned data is never garbage collected. If you want to archive data, you probably use the filestore feature to link to existing files on your HDD that’s not in the repo. Sooner or later, someone will access it, then it can have its hash recalculated.

Fair point. “Use the default settings in 0.4.9 (256kb chunks and sha-256)”

May I ask, what are you archiving? You can add content to IPFS and seed it without copying it to the internal IPFS data store.

Even if the default hash algorithm is changed, you’re still going to be able to download files created with the old one, but creation will be discouraged. I can even add files using sha-1 right now. There’s no reason to do so, but I can, and people can download the hashes without any special settings in their client. So that’s what might happen to SHA256 in a few years.

Nyan · July 28, 2017, 12:33pm

How long is “long term” though? I’m concerned because changing SHA256 to BLAKE2b does seem like a fairly frivolous change - SHA256 isn’t broken or considered weak by any standard. The only reason I could gather was mostly a speed increase - nice, but not something I’d like to see in a stable network.

Thank you very much for the info, though it seems bare. For example, I can’t seem to find much around how file chunks are wrapped in protobufs?

I’m hoping that the design is intended for more than just ‘most people’, unless the original intent was never really to support such usage. Note that I’m looking at this from an uploader’s perspective, not a downloader, so presumably everything will be pinned.
If 10TB is unusual today, it won’t be in 10 years time. Also, spare CPU or I/O time isn’t always necessarily free, particularly in a cloud environment where you pay for what you use, or perhaps even more complex topologies which involve hybrid semi-cold storage (where access can actually be relatively expensive).
To put it another way, I don’t think rehashing is always an acceptable solution unless there is absolutely no other choice.

A single core of an Atom C2750. But the C2750 is quite a powerful processor. Consider one of these - 1GHz Cortex A9 dual core with 2-6TB storage, which seems like very good candidate for IPFS ‘seeding’. I’d be surprised if SHA256 ran faster than 50MB/s total across both cores.
Perhaps you’re thinking of users running high performance 60+W desktop processors? If that’s the case, then yes, CPU is likely not an issue, but on a low power 2W CPU, which I feel will be becoming more prevalent in the future, it can be quite different.

Unfortunately the fragmentation problem I refer to is seen frequently with torrents. A file is shared, then re-shared again and again. Often, .torrent files are not stored, so for less frequently accessed content, it may prompt some to re-create .torrent files from the actual content, enabling it to be re-shared.
Torrent files are identified by the info hash, but unfortunately, this can vary even if the underlying content is identical, for example, due to differences in selected piece sizes and the ordering of files. This causes fragmentation in the network as users attempting to access the content via the older .torrent file may not see peers accessing it via the newly shared .torrent (or torrents, if multiple versions of it are distributed). If Bittorrent had defined a strict hashing mechanism, say fixed chunk size (possibly in a tree fashion) and exact ordering of files, this problem wouldn’t exist as the same content would always yield the same hash, and the network would stay efficient.

I was hoping that IPFS would address the issue, and the way the home page is presented led me to this believe, but thanks to the explanations here, this clearly does not seem to be the aim.

Tim · August 1, 2017, 7:00am

actually everyone use it as default,SHA256

es_00788224 · August 1, 2017, 7:13pm

I’m not a developer, but I’m guessing the intent was something along the lines of “let’s change it while the network is still young”.

Look under “A Chunked File”. You can also use ipfs object get --enc protobuf /ipfs/Qm... to look at real life examples, both of directories and files.

If you’re rehashing large amounts of content, it’s enough that most people rehash their stores. I’m talking about it from a network perspective. From an uploader’s perspective, IPFS will continue to support legacy hashes, you can even use SHA-1 right now even though it was never used in IPFS.
A datastore of 10TB is a lot. It would either require a very fast internet connection, or a lot of the content would be garbage collected, thus shrinking it. If you want to archive files you shouldn’t add them and pin them, you should use the file store. This way you don’t need to store duplicate copies.

Indeed, or laptop processors. As long as they make up a large part of the network, it would be a feasible way to migrate all hashes without manual intervention.

IPFS does, in practice. SHA256+256kb blocks is the standard, and it’s unlikely to change. If it does, it will be due to a network level change. Users changing their hashing settings on a large scale seems implausible.

This isn’t a big problem. Changing file order or filenames will change the hash of the folder they’re inside, but it won’t change the file’s hash. Appending a byte to the end of a file will change the file’s hash, but it will only change the last chunk’s hash. Files and folders are just metadata, IPFS operates on chunks.

It is an aim, and IPFS does it much better than BitTorrent. It’s just not all the way there yet.

Nyan · August 3, 2017, 12:35pm

That makes a lot of sense. I’m just wondering if there is a particular point where the network isn’t considered “young” and stability is a higher goal?

Thanks for the pointers, but it doesn’t seem to be much in the way of documentation unfortunately.

This is the idea. Unfortunately, this doesn’t stop the need to rehash if hashes change.

I was actually referring to torrents there, not IPFS. Apologies if I wasn’t clear about that.
IPFS seems to solve the issues that torrents have, but it introduces issues of its own, in relation to unstable hashes. I was just demonstrating a practical example of the problems with flexible hashing schemes.

es_00788224 · August 3, 2017, 10:29pm

After the switch is done, or when IPFS gets a large userbase.

The IPLD documentation shows the exact structure and encoding used. You could always generate the chunk hashes, hex-encode them, and then look at the IPFS protobuf to see where they end up. Or look at the relevant code in go-ipfs.

You’ll still be able to use the old hash format for downloading, it’s just not going to be recommended for whenever you add a new file.

Nyan · August 6, 2017, 6:07pm

Alright, thank you very much for sticking with me and providing all that info es_00788224!

alexander · February 22, 2018, 8:28pm

@Nyan I’m wondering if you ever found a solution to compute an ipfs hash for a file and have it match what is output from ipfs add ?

Nyan · February 24, 2018, 10:12am

Unfortunately not, but maybe there’s something better out - I haven’t checked however.

alexander · February 26, 2018, 5:54pm

Okay thanks for the response. It is unfortunate that this process isn’t documented as validating IPFS hashes without having to upload a file I think could be useful for many applications. I was able to find logic in this library which I believe is handling the composition of the CIDs https://github.com/ipfs/js-ipfs-unixfs-engine/blob/master/src/builder/builder.js, but still need to dig in a little more.

Wilhelm · March 24, 2018, 2:37pm

@alexander Were you able to find a solution?

alexander · March 26, 2018, 3:53pm

@Wilhelm have a look at this thread. I was given some helpful advice, in the comments that might be helpful to you too https://github.com/ipfs/js-ipfs/issues/1205

Topic		Replies	Views
Calculating IPFS Multihash Of Files Without Uploading Them To IPFS Help go-ipfs	4	2202	May 22, 2018
How to calculate file hash in golang Help go-ipfs	0	440	December 24, 2019
Understanding hashes	13	3215	July 18, 2018
I want to modify the method of calculating hash, which file or function should I modify it Help go-ipfs	0	309	March 6, 2021
Does IPFS have a unique hash for every possible file? go-ipfs , multihash , files	3	2647	September 27, 2018

How to calculate file/directory hash?

Related topics