Application of CID hash for verifying documents

I’m a newbie, so forgive me if this question is irrelevant. I have an idea for something I can use IPFS for.

The company I work for creates documents that bad guys might want to alter and pass off as authentic. So I was thinking we could have a file authentication tool on our website that would verify whether the file in question was an authentic, unaltered file produced by our company.

I was thinking that we could have a page kind of like the “File CID Verifier” on Pinata ( Pinata | Effortless IPFS File Management ). The person could upload the file to the tool, the tool would create a hash, search our database of documents we have done to see whether that hash exists. (The tool only creates a hash; it does not upload the file to the IPFS network.) If it finds that hash in our database, it would tell the visitor that the file is authentic. If the hash doesn’t exist, it would tell them that the file may not be authentic.

A few questions:
1st, is this a good idea? or am I just reinventing the wheel for some solution that is already out there?
2nd, is this doable with IPFS technology? Or should I be looking for some other technology to do this with?
3rd, anybody out there think they can do this? Are you interested in talking about it further?

We do so much better now.
We live in a world where cryptography exists.

1st, is this a good idea? or am I just reinventing the wheel for some solution that is already out there?

Yes it is, that would work, that actually a very well known problem.

So your solution is pretty good however is has an issue, you need to be online for checks to happen plus you as the company can remove entries from the database, both things that other actors wouldn’t like in theory.

The other way to do it is with a signature.
Instead of having a database of hashes, your company holds a private key that it uses to encrypt hashes of correct files producing “signatures”.
Then when someone wants to check a file authenticity, they take their tool, hash the file, decrypt the signature with your public key and check that both results match, if they do that mean the file is correct.

Other parties usually like that more because they only have to download the public key once from you and then can do all checks offline themself, that avoids downtime (a simple http server cluster easly hits 100% uptime while a database of hashes that more likely to hit ~99%) and if they have a LOT of files to check, they are not bound by your database speed, they can just use more servers and parallelize. It’s also best privacy wise because with the database you can know which person is checking which files while the signature you can only know that once person download your public key, you don’t know how much files they check nor which one.
However you as the company might dislike the signature as if you get hacked and private key is leaked you can’t remove bad signatures, you basically need to start over with a new key set.

For example on how the signature works you can open Index of /debian-cd/current/amd64/iso-cd
There is 3 types of files :

  • *.iso, thoses ones contains the actual data.
  • SHA{256,512}SUMS contains the hashes of thoses files.
  • and *.sign which are signatures of the hash files.

2nd, is this doable with IPFS technology?

Yes you can but there is no real reason to do so.
IPFS don’t just hash files, it make a lot of work to make them shareable in the network.
Basically IPFS cuts your files into 256Kib chunks (we call them leaves), then it creates metablocks (we call them “Roots”) that contain the list of all blocks to download, and if thoses roots gets above 256Kib too it create roots that list other roots until you get to a single block listing all original chunks and the hash of that block is the CID.
If you only care about the hash part just hash the file, the chunking part of IPFS is only usefull because we are sending chunks over the network, this allows us to download different chunks from multiple nodes at the same time.

Or should I be looking for some other technology to do this with?

The tech you need is called SHA256 and Ed25519 (there are debates about which is the “best” one, but thoses are very strong bets).
Crypto is hard, but realistically this completely falls under the category of things that faster to do rather than searching for something already doing that. (note I don’t recommend you implement SHA256 and Ed25519 but thoses algorithm are figurated out, just google : “Sha256 <your programing language>” and you will find likely the best result in the top 3 stackoverflow links, same for Ed25519).

3rd, anybody out there think they can do this? Are you interested in talking about it further?

I could, basically hash, sign on a server in your company.
And in the website part, hash decrypt and check they match.
It’s like 1~3 days of work for PoC.
The biggest questions are how your existing software is gonna interact with the database or signing servers and how are you gonna protect them from hackers ? Which will take up to a few weeks or month of work (and you can hire a front end to make it pretty in the mean time).

Thanks so much for the very thoughtful and detailed response. Reading that was an education!

I don’t really understand what you are saying about:

The other way to do it is with a signature. Instead of having a database of hashes, your company holds a private key that it uses to encrypt hashes of correct files producing “signatures”.

I don’t really understand the tech behind, for example, the PDF tools’ digital signatures.

(I’m not asking you to explain your meaning further; I just mean that I need to investigate that and learn more about what you are talking about in that section.)

I understand that you are saying the digital signature with a private and public key is a better solution. On the other hand, I’m still intrigued with the possibilities of the hashing technology.

I’m not too concerned about how much work it would take on our website for the customer-facing part where they verify the document. However, I’m more concerned about how we would implement that on the back end with our project managers. Unless it were highly integrated into our tools and automated, it would just be another manual step to enter the hash into the database. And I’m not much in the mood for more manual work for our project managers.

It sounds like what you are saying about the digital signatures that those are likely much better from an implementation cost perspective as well as from a process workflow perspective.

Hmm… I’ll keep this in mind and get back to you about this if we want to discuss with you what you might be able to do regarding implementation.

Thanks a lot