File Format Identification Post Generation of HASH

Iyengar · April 28, 2020, 9:46pm

A file or content added to a private IPFS network could be in any file format and it creates a hash that can be viewed say for example http://127.0.0.1:8080/ipfs/QmNUqmqNH953LHPHu1QqfuuR55GMcR2EcnxvhVeKPg6CZd. The hash generated here is a sample of a MP4 video content.
How can one recognize that the HASH generated was for a particular file format.

This is a major requirement for Law Enforcement Agencies, they would certainly want to block or effectively monitor objectionable content that could be camouflaged with hash and easily transmitted. Any thoughts on the same?

Akita · April 28, 2020, 10:38pm

You can’t deduce the format from the hash.
But you couldn’t deduce the format from the name of the file or the extension anyway. I can shot a movie, rename it as ThisIsATextISwearSirNothingFishyHere.txt, and you will be able to open it with VLC and enjoy the movie. In the end, the “IPFS hash”, or CID maps to a string of bytes. It’s up tu the application to decide what it is and how to read it.

Iyengar · April 29, 2020, 5:01am

That is a valid point. Thank you

hector · April 29, 2020, 9:25am

Technically (this is going to be a bit nitty-picky)…

The CID (or hash) includes identifies the type of data that it is pointing to (it is a multicodec, and has a codec id in it).

However, when you add an mp3 to IPFS the codec will not be “mp3” but “unixfs”. That is because you are not referencing the mp3 file. You are referencing an IPFS-specific file type that contains the different blocks and Merkle-DAG structure that makes up the mp3 file (and allow it to move it around in the network). So the mp3 file is contained inside this “unixfs” DAG and the CIDs are pointing to unixfs blocks.

Codecs are defined in the multiformats project. There is talk about adding codecs for MIME types (https://github.com/multiformats/multicodec/issues/4) because theoretically you could use a CID to reference whatever type directly.

In practice, in IPFS land, things need to be split in small blocks though, so effectively the vast majority of CIDs you will see will be “unixfs”. When you request the file from ipfs, however, ipfs transparently puts it all together for you so that you get the original content.

So in the end, in order to know the file type of something you will likely need to download the content and check. For many formats, downloading the first few bytes (first unixfs block in our case), suffices to make a pretty good guess. This “guessing” (along with extension when possible) is used by the IPFS gateways to tell your browser about file types when you are using them, thus allowing the browser to show an image or play a video directly, rather than offering you to download a file of opaque type.

Hope this clarifies a bit more in detail.

Iyengar · April 29, 2020, 7:18pm

Yes Hector

It clarifies. Thank you very much for a detailed explanation

Regards

Ashok Iyengar

Topic		Replies	Views
IFPS hash in apps like thunar? Ecosystem and Usage	3	404	November 18, 2020
Application of CID hash for verifying documents files	2	469	November 19, 2021
Is there an API method to get the IPFS hash of a file? Help	8	1157	May 17, 2023
Sneak peek at a resource without downloading Help go-ipfs , files	2	961	October 23, 2018
Identify Content-Type from CID Help	3	160	June 8, 2024

File Format Identification Post Generation of HASH

Related topics