I only mentioned BitTorrent as an example of a decentralized system that follows the upload once and download many times workflow. Unless weāre not considering BitTorrent to be decentralized, I was trying to understand how āhaving only one person add a file to IPFS and then distribute the hash seems counter-intuitive to the notion of decentralizationā. Anybody in the swarm who has/requests an IPFS hash can become a provider of the content it addresses.
Iām not aware of a single place that documents these all thoroughly but this would be helpful. I usually use this command reference to try to find existing commands that might do what Iām looking for, but in terms of completeness I think itās the same as the help text output.
Iām not sure what that comment authorās experience includes, but based on this discussion around the implementation of the trickle-dag format there are some tangible differences: https://github.com/ipfs/go-ipfs/pull/713
With Bittorrent, the system relies on one point of centralization: the indexer. Once the index goes down, no-one can find the content. Of course, if some have already downloaded the content, they can recreate torrent files and share them to those around them, however Bittorrent doesnāt have a strictly defined hash, the generated torrents may be different, thus the small torrent swarms may never be able to discover each other.
IPFS is probably better in this regard, if hash settings arenāt changed, as they should usually be the same, but it doesnāt seem to solve the underlying problem.
Unfortunately, it gives very little information on what trickle does. I havenāt read through your other link yet, but thank you very much for the information.
Thank you for the link. It seems like the aim of the project is different to what I had imagined. I suppose the following quote expresses my thinking at the moment:
In theory, the same file can have multiple hashes. In practice, mostly everyone uses the default settings.
Are they any good? How do they solve the spam problem?
Why canāt this be agreed on by some consensus, instead of letting every client do as they please? It should at least be strongly discouraged to use anything else than the standard function.
Currently, by default, the hash of a block is the multihash of the sha256 of the blockās data. For example, split file.bin; sha256sum x*. The hash of a file is the hash of the canonicalized IPLD object containing the chunks: https://github.com/ipld/specs/tree/master/ipld
Just hash them again. This is a relatively small project, everyone is using go-ipfs right now. If you want to switch over to blake2b, just add a line of code in the next update that re-hashes all sha256 chunks in the datastore as blake2b next time the daemon is restarted. Then the old sha256 chunks are changed to point to the blake2b chunks until they get garbage collected. You could use both hashes at the same time during a transition period during which anything in the datastore is made available under both hashes, but anything new added will by default get outputted by the GUI as blake2b. Then when nobody is distributing sha256 hashes anymore, you just remove the dual hashing in the background.
From what Iāve seen, it appears to be more complex than your command seems to indicate, including a protobuf wrapper, and hashes organised in tree form?
The downside here is that it requires the file to be in the file store of the one making this change, and requires an expensive rehash operation. I canāt see many people willingly do this unfortunately.
I was originally considering the idea of hashing a file but not āseedingā it, nonetheless distributing the hash. The idea being that if someone else decided to actually add the file to IPFS, the hash would become valid and content discoverable by everyone. Unfortunately, it appears that stability in the hashing system isnāt really a goal, so I guess IPFS doesnāt really suit my use case.
Thanks again for all the info given.
Yes, and I agree itās a terrible āfeatureā, but in practice itās rare for people to mess around with the settings.
I mean the indexing. There have been similar attempts before, and theyāve mostly been rubbish.
No, the hash of a block is just the sha256 of it. The hash of a file is a different animal, see above.
The default size of the file store is 10gb. You only need to rehash the files once every time you change the protocol. A rehash operation is I/O bottlenecked. A HDD reads at about 150 mbit. So it would take about 9 minutes to rehash a full default size file store. This only needs to be done once every few years. Wikipedia suggests your average hash function has a shelf life around 10 years. Since files are read sooner or later anyway, you could just re-hash them whenever you get around to reading them. If theyāre not accessed, the garbage collector will get them, so whatever you do the old hashes will slowly be purged from the network. You donāt need the entire file, it can be done separately for each block.
Itās still in development. But you could just stipulate they use the default hash options, you need to go out of your way to change them and itās rare for people to do so.
I agree with you, itās not a good idea to allow people to change the hash options willy-nilly just because of āfreedomā. But itās no big problem in practice.
I think you misunderstand. I meant that the applicationās default settings can change. For example, BLAKE2b is being considered as a new default instead of SHA256, without any manual intervention from the user (other than updating software). Itās unknown whether the other options will get new defaults in the future, but I havenāt heard any statement that theyāre expected to rarely change.
Oh I see. But yeah, the hash of the file is what Iām after.
Try replacing 10GB with 10TB, unless youāre advocating that no-one would change such a setting? (though Iād argue that even 10TB is small if youāre looking a few years into the future)
Does somewhat depend on the hash and CPU, as well as whether a low power disk is being used. For large amounts of storage, it doesnāt seem unlikely for a low power CPU (say ARM) and low power, large capacity disk to be used. I get around 82MB/s for SHA256 on a 2.4GHz Silvermont Atom CPU here, which is definitely slower than the disk throughput rate. BLAKE2b is likely faster, but I donāt have anything to test it at the moment.
The concern I have is that SHA256 isnāt really out of its āshelf lifeā, that is, itās not known to be weak, and yet, itās being replaced.
For data you want archived, this may not be the best solution I suspect.
As well as a particular client and client version?
Theyāre not going to be changed frivolously though, that goes against common sense. So you can hash them with both SHA256 and BLAKE2b for long-term archival. Theyāre not going to remove support for SHA256. You can specify any supported hash function as the parameter for --hash.
Check the IPLD spec under Canonical Format, A Chunked File, and A Directory.
Iām discussing a mechanism for transitioning between hash types performed on a network scale. Most people wouldnāt have 10TB large file stores. If someone does have a 10TB file store, it can be done whenever there are spare CPU cycles and disk I/O, or whenever a file is read for other reasons. A 10TB file store would be unlikely to be filled though, since you only download content that you want to download, and inactive content that you havenāt pinned gets automatically deleted.
A regular desktop HDD can read at 170MB/s at most. Your 82MB/s figure, is that for all cores or single-threaded? The average nodeās CPU is faster at hashing than their HDD is at reading, and I think there are very few nodes who are slower at hashing. For nodes that are faster at reading, this optimization wonāt give them anything. But they already have to do the hashing once when downloading to check the data integrity. It can be done when the CPU is at low utilization or over a longer period of time.
If you want to archive data, you pin it. Pinned data is never garbage collected. If you want to archive data, you probably use the filestore feature to link to existing files on your HDD thatās not in the repo. Sooner or later, someone will access it, then it can have its hash recalculated.
Fair point. āUse the default settings in 0.4.9 (256kb chunks and sha-256)ā
May I ask, what are you archiving? You can add content to IPFS and seed it without copying it to the internal IPFS data store.
Even if the default hash algorithm is changed, youāre still going to be able to download files created with the old one, but creation will be discouraged. I can even add files using sha-1 right now. Thereās no reason to do so, but I can, and people can download the hashes without any special settings in their client. So thatās what might happen to SHA256 in a few years.
How long is ālong termā though? Iām concerned because changing SHA256 to BLAKE2b does seem like a fairly frivolous change - SHA256 isnāt broken or considered weak by any standard. The only reason I could gather was mostly a speed increase - nice, but not something Iād like to see in a stable network.
Thank you very much for the info, though it seems bare. For example, I canāt seem to find much around how file chunks are wrapped in protobufs?
Iām hoping that the design is intended for more than just āmost peopleā, unless the original intent was never really to support such usage. Note that Iām looking at this from an uploaderās perspective, not a downloader, so presumably everything will be pinned.
If 10TB is unusual today, it wonāt be in 10 years time. Also, spare CPU or I/O time isnāt always necessarily free, particularly in a cloud environment where you pay for what you use, or perhaps even more complex topologies which involve hybrid semi-cold storage (where access can actually be relatively expensive).
To put it another way, I donāt think rehashing is always an acceptable solution unless there is absolutely no other choice.
A single core of an Atom C2750. But the C2750 is quite a powerful processor. Consider one of these - 1GHz Cortex A9 dual core with 2-6TB storage, which seems like very good candidate for IPFS āseedingā. Iād be surprised if SHA256 ran faster than 50MB/s total across both cores.
Perhaps youāre thinking of users running high performance 60+W desktop processors? If thatās the case, then yes, CPU is likely not an issue, but on a low power 2W CPU, which I feel will be becoming more prevalent in the future, it can be quite different.
Unfortunately the fragmentation problem I refer to is seen frequently with torrents. A file is shared, then re-shared again and again. Often, .torrent files are not stored, so for less frequently accessed content, it may prompt some to re-create .torrent files from the actual content, enabling it to be re-shared.
Torrent files are identified by the info hash, but unfortunately, this can vary even if the underlying content is identical, for example, due to differences in selected piece sizes and the ordering of files. This causes fragmentation in the network as users attempting to access the content via the older .torrent file may not see peers accessing it via the newly shared .torrent (or torrents, if multiple versions of it are distributed). If Bittorrent had defined a strict hashing mechanism, say fixed chunk size (possibly in a tree fashion) and exact ordering of files, this problem wouldnāt exist as the same content would always yield the same hash, and the network would stay efficient.
I was hoping that IPFS would address the issue, and the way the home page is presented led me to this believe, but thanks to the explanations here, this clearly does not seem to be the aim.
Iām not a developer, but Iām guessing the intent was something along the lines of āletās change it while the network is still youngā.
Look under āA Chunked Fileā. You can also use ipfs object get --enc protobuf /ipfs/Qm... to look at real life examples, both of directories and files.
If youāre rehashing large amounts of content, itās enough that most people rehash their stores. Iām talking about it from a network perspective. From an uploaderās perspective, IPFS will continue to support legacy hashes, you can even use SHA-1 right now even though it was never used in IPFS.
A datastore of 10TB is a lot. It would either require a very fast internet connection, or a lot of the content would be garbage collected, thus shrinking it. If you want to archive files you shouldnāt add them and pin them, you should use the file store. This way you donāt need to store duplicate copies.
Indeed, or laptop processors. As long as they make up a large part of the network, it would be a feasible way to migrate all hashes without manual intervention.
IPFS does, in practice. SHA256+256kb blocks is the standard, and itās unlikely to change. If it does, it will be due to a network level change. Users changing their hashing settings on a large scale seems implausible.
This isnāt a big problem. Changing file order or filenames will change the hash of the folder theyāre inside, but it wonāt change the fileās hash. Appending a byte to the end of a file will change the fileās hash, but it will only change the last chunkās hash. Files and folders are just metadata, IPFS operates on chunks.
It is an aim, and IPFS does it much better than BitTorrent. Itās just not all the way there yet.
That makes a lot of sense. Iām just wondering if there is a particular point where the network isnāt considered āyoungā and stability is a higher goal?
Thanks for the pointers, but it doesnāt seem to be much in the way of documentation unfortunately.
This is the idea. Unfortunately, this doesnāt stop the need to rehash if hashes change.
I was actually referring to torrents there, not IPFS. Apologies if I wasnāt clear about that.
IPFS seems to solve the issues that torrents have, but it introduces issues of its own, in relation to unstable hashes. I was just demonstrating a practical example of the problems with flexible hashing schemes.
After the switch is done, or when IPFS gets a large userbase.
The IPLD documentation shows the exact structure and encoding used. You could always generate the chunk hashes, hex-encode them, and then look at the IPFS protobuf to see where they end up. Or look at the relevant code in go-ipfs.
Youāll still be able to use the old hash format for downloading, itās just not going to be recommended for whenever you add a new file.
Okay thanks for the response. It is unfortunate that this process isnāt documented as validating IPFS hashes without having to upload a file I think could be useful for many applications. I was able to find logic in this library which I believe is handling the composition of the CIDs https://github.com/ipfs/js-ipfs-unixfs-engine/blob/master/src/builder/builder.js, but still need to dig in a little more.