How to calculate file/directory hash?

I only mentioned BitTorrent as an example of a decentralized system that follows the upload once and download many times workflow. Unless weā€™re not considering BitTorrent to be decentralized, I was trying to understand how ā€œhaving only one person add a file to IPFS and then distribute the hash seems counter-intuitive to the notion of decentralizationā€. Anybody in the swarm who has/requests an IPFS hash can become a provider of the content it addresses.

Iā€™m not aware of a single place that documents these all thoroughly but this would be helpful. I usually use this command reference to try to find existing commands that might do what Iā€™m looking for, but in terms of completeness I think itā€™s the same as the help text output.

Iā€™m not sure what that comment authorā€™s experience includes, but based on this discussion around the implementation of the trickle-dag format there are some tangible differences: https://github.com/ipfs/go-ipfs/pull/713

Side note: I was trying to find this conversation since Iā€™d stumbled across it before, but hereā€™s a relevant discussion related to the hash flexibility discussion in this thread: Hash conversion for import/export and long term archival Ā· Issue #1953 Ā· ipfs/kubo Ā· GitHub

1 Like

With Bittorrent, the system relies on one point of centralization: the indexer. Once the index goes down, no-one can find the content. Of course, if some have already downloaded the content, they can recreate torrent files and share them to those around them, however Bittorrent doesnā€™t have a strictly defined hash, the generated torrents may be different, thus the small torrent swarms may never be able to discover each other.

IPFS is probably better in this regard, if hash settings arenā€™t changed, as they should usually be the same, but it doesnā€™t seem to solve the underlying problem.

Unfortunately, it gives very little information on what trickle does. I havenā€™t read through your other link yet, but thank you very much for the information.

Thank you for the link. It seems like the aim of the project is different to what I had imagined. I suppose the following quote expresses my thinking at the moment:

Thanks again @leerspace for all the info!

1 Like

Newer clients like Tribler have decentralized indexes: Truly Decentralized BitTorrent Downloading Has Finally Arrived * TorrentFreak

also some great discussions about this issue here: if the Permanent Web is ā€œcontent-addressableā€, could it be designed so that each file has only one address? Ā· Issue #126 Ā· ipfs/notes Ā· GitHub

and avoid duplicating files added to ipfs Ā· Issue #875 Ā· ipfs/kubo Ā· GitHub

1 Like

In theory, the same file can have multiple hashes. In practice, mostly everyone uses the default settings.

Are they any good? How do they solve the spam problem?

Why canā€™t this be agreed on by some consensus, instead of letting every client do as they please? It should at least be strongly discouraged to use anything else than the standard function.

Currently, by default, the hash of a block is the multihash of the sha256 of the blockā€™s data. For example, split file.bin; sha256sum x*. The hash of a file is the hash of the canonicalized IPLD object containing the chunks:
https://github.com/ipld/specs/tree/master/ipld

1 Like

Just hash them again. This is a relatively small project, everyone is using go-ipfs right now. If you want to switch over to blake2b, just add a line of code in the next update that re-hashes all sha256 chunks in the datastore as blake2b next time the daemon is restarted. Then the old sha256 chunks are changed to point to the blake2b chunks until they get garbage collected. You could use both hashes at the same time during a transition period during which anything in the datastore is made available under both hashes, but anything new added will by default get outputted by the GUI as blake2b. Then when nobody is distributing sha256 hashes anymore, you just remove the dual hashing in the background.

1 Like

ā€¦which, if they change, can result in multiple hashesā€¦

There was a bunch of security issues reported about Tribler: https://lists.torproject.org/pipermail/tor-dev/2014-December/007999.html I have no clue whether theyā€™re fundamental issues though, or whether theyā€™ve been addressed.

From what Iā€™ve seen, it appears to be more complex than your command seems to indicate, including a protobuf wrapper, and hashes organised in tree form?

The downside here is that it requires the file to be in the file store of the one making this change, and requires an expensive rehash operation. I canā€™t see many people willingly do this unfortunately.


I was originally considering the idea of hashing a file but not ā€˜seedingā€™ it, nonetheless distributing the hash. The idea being that if someone else decided to actually add the file to IPFS, the hash would become valid and content discoverable by everyone. Unfortunately, it appears that stability in the hashing system isnā€™t really a goal, so I guess IPFS doesnā€™t really suit my use case.
Thanks again for all the info given.

1 Like

Yes, and I agree itā€™s a terrible ā€œfeatureā€, but in practice itā€™s rare for people to mess around with the settings.

I mean the indexing. There have been similar attempts before, and theyā€™ve mostly been rubbish.

No, the hash of a block is just the sha256 of it. The hash of a file is a different animal, see above.

The default size of the file store is 10gb. You only need to rehash the files once every time you change the protocol. A rehash operation is I/O bottlenecked. A HDD reads at about 150 mbit. So it would take about 9 minutes to rehash a full default size file store. This only needs to be done once every few years. Wikipedia suggests your average hash function has a shelf life around 10 years. Since files are read sooner or later anyway, you could just re-hash them whenever you get around to reading them. If theyā€™re not accessed, the garbage collector will get them, so whatever you do the old hashes will slowly be purged from the network. You donā€™t need the entire file, it can be done separately for each block.

Itā€™s still in development. But you could just stipulate they use the default hash options, you need to go out of your way to change them and itā€™s rare for people to do so.

I agree with you, itā€™s not a good idea to allow people to change the hash options willy-nilly just because of ā€œfreedomā€. But itā€™s no big problem in practice.

1 Like

I think you misunderstand. I meant that the applicationā€™s default settings can change. For example, BLAKE2b is being considered as a new default instead of SHA256, without any manual intervention from the user (other than updating software). Itā€™s unknown whether the other options will get new defaults in the future, but I havenā€™t heard any statement that theyā€™re expected to rarely change.

Oh I see. But yeah, the hash of the file is what Iā€™m after.

Try replacing 10GB with 10TB, unless youā€™re advocating that no-one would change such a setting? (though Iā€™d argue that even 10TB is small if youā€™re looking a few years into the future)

Does somewhat depend on the hash and CPU, as well as whether a low power disk is being used. For large amounts of storage, it doesnā€™t seem unlikely for a low power CPU (say ARM) and low power, large capacity disk to be used. I get around 82MB/s for SHA256 on a 2.4GHz Silvermont Atom CPU here, which is definitely slower than the disk throughput rate. BLAKE2b is likely faster, but I donā€™t have anything to test it at the moment.

The concern I have is that SHA256 isnā€™t really out of its ā€˜shelf lifeā€™, that is, itā€™s not known to be weak, and yet, itā€™s being replaced.

For data you want archived, this may not be the best solution I suspect.

As well as a particular client and client version?

1 Like

Theyā€™re not going to be changed frivolously though, that goes against common sense. So you can hash them with both SHA256 and BLAKE2b for long-term archival. Theyā€™re not going to remove support for SHA256. You can specify any supported hash function as the parameter for --hash.

Check the IPLD spec under Canonical Format, A Chunked File, and A Directory.

Iā€™m discussing a mechanism for transitioning between hash types performed on a network scale. Most people wouldnā€™t have 10TB large file stores. If someone does have a 10TB file store, it can be done whenever there are spare CPU cycles and disk I/O, or whenever a file is read for other reasons. A 10TB file store would be unlikely to be filled though, since you only download content that you want to download, and inactive content that you havenā€™t pinned gets automatically deleted.

A regular desktop HDD can read at 170MB/s at most. Your 82MB/s figure, is that for all cores or single-threaded? The average nodeā€™s CPU is faster at hashing than their HDD is at reading, and I think there are very few nodes who are slower at hashing. For nodes that are faster at reading, this optimization wonā€™t give them anything. But they already have to do the hashing once when downloading to check the data integrity. It can be done when the CPU is at low utilization or over a longer period of time.

If you want to archive data, you pin it. Pinned data is never garbage collected. If you want to archive data, you probably use the filestore feature to link to existing files on your HDD thatā€™s not in the repo. Sooner or later, someone will access it, then it can have its hash recalculated.

Fair point. ā€œUse the default settings in 0.4.9 (256kb chunks and sha-256)ā€

May I ask, what are you archiving? You can add content to IPFS and seed it without copying it to the internal IPFS data store.

Even if the default hash algorithm is changed, youā€™re still going to be able to download files created with the old one, but creation will be discouraged. I can even add files using sha-1 right now. Thereā€™s no reason to do so, but I can, and people can download the hashes without any special settings in their client. So thatā€™s what might happen to SHA256 in a few years.

1 Like

How long is ā€œlong termā€ though? Iā€™m concerned because changing SHA256 to BLAKE2b does seem like a fairly frivolous change - SHA256 isnā€™t broken or considered weak by any standard. The only reason I could gather was mostly a speed increase - nice, but not something Iā€™d like to see in a stable network.

Thank you very much for the info, though it seems bare. For example, I canā€™t seem to find much around how file chunks are wrapped in protobufs?

Iā€™m hoping that the design is intended for more than just ā€˜most peopleā€™, unless the original intent was never really to support such usage. Note that Iā€™m looking at this from an uploaderā€™s perspective, not a downloader, so presumably everything will be pinned.
If 10TB is unusual today, it wonā€™t be in 10 years time. Also, spare CPU or I/O time isnā€™t always necessarily free, particularly in a cloud environment where you pay for what you use, or perhaps even more complex topologies which involve hybrid semi-cold storage (where access can actually be relatively expensive).
To put it another way, I donā€™t think rehashing is always an acceptable solution unless there is absolutely no other choice.

A single core of an Atom C2750. But the C2750 is quite a powerful processor. Consider one of these - 1GHz Cortex A9 dual core with 2-6TB storage, which seems like very good candidate for IPFS ā€˜seedingā€™. Iā€™d be surprised if SHA256 ran faster than 50MB/s total across both cores.
Perhaps youā€™re thinking of users running high performance 60+W desktop processors? If thatā€™s the case, then yes, CPU is likely not an issue, but on a low power 2W CPU, which I feel will be becoming more prevalent in the future, it can be quite different.

Unfortunately the fragmentation problem I refer to is seen frequently with torrents. A file is shared, then re-shared again and again. Often, .torrent files are not stored, so for less frequently accessed content, it may prompt some to re-create .torrent files from the actual content, enabling it to be re-shared.
Torrent files are identified by the info hash, but unfortunately, this can vary even if the underlying content is identical, for example, due to differences in selected piece sizes and the ordering of files. This causes fragmentation in the network as users attempting to access the content via the older .torrent file may not see peers accessing it via the newly shared .torrent (or torrents, if multiple versions of it are distributed). If Bittorrent had defined a strict hashing mechanism, say fixed chunk size (possibly in a tree fashion) and exact ordering of files, this problem wouldnā€™t exist as the same content would always yield the same hash, and the network would stay efficient.

I was hoping that IPFS would address the issue, and the way the home page is presented led me to this believe, but thanks to the explanations here, this clearly does not seem to be the aim.

1 Like

actually everyone use it as default,SHA256

1 Like

Iā€™m not a developer, but Iā€™m guessing the intent was something along the lines of ā€œletā€™s change it while the network is still youngā€.

Look under ā€œA Chunked Fileā€. You can also use ipfs object get --enc protobuf /ipfs/Qm... to look at real life examples, both of directories and files.

If youā€™re rehashing large amounts of content, itā€™s enough that most people rehash their stores. Iā€™m talking about it from a network perspective. From an uploaderā€™s perspective, IPFS will continue to support legacy hashes, you can even use SHA-1 right now even though it was never used in IPFS.
A datastore of 10TB is a lot. It would either require a very fast internet connection, or a lot of the content would be garbage collected, thus shrinking it. If you want to archive files you shouldnā€™t add them and pin them, you should use the file store. This way you donā€™t need to store duplicate copies.

Indeed, or laptop processors. As long as they make up a large part of the network, it would be a feasible way to migrate all hashes without manual intervention.

IPFS does, in practice. SHA256+256kb blocks is the standard, and itā€™s unlikely to change. If it does, it will be due to a network level change. Users changing their hashing settings on a large scale seems implausible.

This isnā€™t a big problem. Changing file order or filenames will change the hash of the folder theyā€™re inside, but it wonā€™t change the fileā€™s hash. Appending a byte to the end of a file will change the fileā€™s hash, but it will only change the last chunkā€™s hash. Files and folders are just metadata, IPFS operates on chunks.

It is an aim, and IPFS does it much better than BitTorrent. Itā€™s just not all the way there yet.

3 Likes

That makes a lot of sense. Iā€™m just wondering if there is a particular point where the network isnā€™t considered ā€œyoungā€ and stability is a higher goal?

Thanks for the pointers, but it doesnā€™t seem to be much in the way of documentation unfortunately.

This is the idea. Unfortunately, this doesnā€™t stop the need to rehash if hashes change.

I was actually referring to torrents there, not IPFS. Apologies if I wasnā€™t clear about that.
IPFS seems to solve the issues that torrents have, but it introduces issues of its own, in relation to unstable hashes. I was just demonstrating a practical example of the problems with flexible hashing schemes.

1 Like

After the switch is done, or when IPFS gets a large userbase.

The IPLD documentation shows the exact structure and encoding used. You could always generate the chunk hashes, hex-encode them, and then look at the IPFS protobuf to see where they end up. Or look at the relevant code in go-ipfs.

Youā€™ll still be able to use the old hash format for downloading, itā€™s just not going to be recommended for whenever you add a new file.

1 Like

Alright, thank you very much for sticking with me and providing all that info es_00788224! :slight_smile:

1 Like

@Nyan Iā€™m wondering if you ever found a solution to compute an ipfs hash for a file and have it match what is output from ipfs add ?

2 Likes

Unfortunately not, but maybe thereā€™s something better out - I havenā€™t checked however.

Okay thanks for the response. It is unfortunate that this process isnā€™t documented as validating IPFS hashes without having to upload a file I think could be useful for many applications. I was able to find logic in this library which I believe is handling the composition of the CIDs https://github.com/ipfs/js-ipfs-unixfs-engine/blob/master/src/builder/builder.js, but still need to dig in a little more.

1 Like

@alexander Were you able to find a solution?

1 Like

@Wilhelm have a look at this thread. I was given some helpful advice, in the comments that might be helpful to you too https://github.com/ipfs/js-ipfs/issues/1205

2 Likes