Is there an API method to get the IPFS hash of a file?

Since the CID for things stored in IPFS are hashes of the content, is there an API call that will take a file argument and return the CID for it?

If so that would be an easy way to verify the content was transfered correctly. On rare occasions no errors are produced by “ipfs get” yet the file is incomplete. If such an API call existed it would be trivial to verify the file was delivered intact.

I’ve really looked at the API carefully, but the number of calls is quite large. Nothing jumps out at me to satisfy my question, but it surprises me such a function wouldn’t exist. Perhaps I missed it? Can anyone comment on how to verify the integrity of files obtained with ipfs get using the API?

Hashing the file contents is intrinsic to how IPFS functions, so such a function would seem to be trivial to provide.

You will need the DAG to calculate the exact hash. Unless the DAG is returned, you would need to construct the DAG from the raw data. There are several options that could affect how the DAG looks like. so in order to get the exact hash, those options have to be the same (defaults are used usually).

To get started, you can probably take a look the “only-hash” option from the add command.

Hope this helps!

1 Like

Yeah, like I used a general Internet search to discover the --only-hash / -n option for the add command. I thought I’d get a better / faster response here so I tried here first.

Your answer muddies the waters so to speak tho. It sounds like I’ll get a different hash value for the exact same file, which seems very odd.

I will run some tests to see if that assertion holds true. Since IPFS is based on “content addressability” and the hash (CID) represents that content, it seems very weird to me that the --only-hash option would return a different value for the exact same file, but I will see.

Thanks for the input. If my experiments indicate a difference I will need to dive deeper to see if these other considerations you refer to are universally available and constant for all potential recipients of the file being transfered through IPFS.

The test I will run is add some file to ipfs, get it on a different server, verify (using sha512sum) the 2 files match, and use the --only-hash with that file to see if the value returned matches the CID I used to get the file out of ipfs.

My tests show the exact same CID is returned from ipfs add -n <file> as was used to request it from IPFS. Granted, there may be variables my testing hasn’t fully taken into account, but the CID is fundamental to IPFS. To insure files added to it will remain accessible into the future the result of the hashing algorithm used to generate CIDs would need to be rock solid & unchanging.

I tested files of various sizes added to servers over a span of 4 years, and in every case the CID from ipfs add -n <file> was the same. With files that old, who knows which servers are actually the source for the request now, or whether the IPFS version differs from my local node.

The CID in IPFS is not the hash of the file. It’s the hash of the root
block of a DAG representing that file.

When you add a file to IPFS, the file is split into blocks (chunks) and
those blocks are hashed. A DAG is built up out of those blocks. So a
file has very very many possible hashes in IPFS, even if you don’t
account for hashing algorithm.

Let’s take a quick example. I don’t know how this is going to come out,
since I’m writing it in email.

I have this file stored on my node. The SHA256 hash of the file is:
d088ffe20ea1eb6b51ffaed1833310446812096beb19e2c444147cd3a3cdd77e.
Its CID is: QmSuook5umbYEELYcY9pDmfpmbEwCjNdjutXMi19spYFPg.

$ ipfs dag stat QmSuook5umbYEELYcY9pDmfpmbEwCjNdjutXMi19spYFPg
Size: 52488471, NumBlocks: 204

$ ipfs block stat QmSuook5umbYEELYcY9pDmfpmbEwCjNdjutXMi19spYFPg
Key: QmSuook5umbYEELYcY9pDmfpmbEwCjNdjutXMi19spYFPg
Size: 109

The Qm CID above is like a pointer. It’s the content address of the
root block of the file. That root block is a protobuf, so we can drill
down a little bit and see what’s inside.

ipfs block get QmSuook5umbYEELYcY9pDmfpmbEwCjNdjutXMi19spYFPg \
>protoc --decode_raw

The protoc output is kinda inscrutable, but if you peek at it closely
enough, it describes a block containing two links to additional blocks.
The block also contains a data field, which is basically like a file header,
telling us that this block represents a file and the total file size is
52475881 bytes.

Those two additional blocks are going to link to even more blocks.

There isn’t really a concept of “the hash of a file in IPFS”. There is
a hash of the root block made from one particular splitting of a file,
and that’s all.

Thank you teiresias for that awesome reply.

Shortcuts and approximations in our descriptions propagate and lend to inaccurate communication of concepts. I can’t tell you how many times I’ve heard the term IPFS hash explained, but never as well as you did. My confusion is probably why CID is a better term to use to describe a file’s “address”. Hashing is undoubtedly used in many aspects of IPFS, but as you point out it’s not directly on the content.

If any of the blocks that make up the merkle tree for a file are missing, would that affect the CID returned from ipfs get -n <file> ?

Not knowing the cause of the failures I described in the OP, it’s difficult to say if any use of the IPFS API could be used to accomplish the intended verification. Ofc SHA256 could certainly be used, but my premise was why bother if IPFS provides a similar function. It does raise the question of what measures IPFS does take to insure reliable delivery, and whether the rare failures observed are due to non-IPFS issues.

I assume you meant ipfs add -n. And no, that won’t make a difference,
because ipfs add -n essentially just reads the content from your file
and does the “chunk and hash the blocks” step without reading the
blockstore or trying to fetch blocks from the network. Kind of like a
dry-run upload that gives you a CID back.

Going back to your original post, you said:
“On rare occasions no errors are produced by ipfs get yet the file is
incomplete.”

To me, that sounds like a bug in ipfs get. Does it exit zero too?
If it does, then yeah that’s definitely a bug.

Another thing you can use to pull a DAG to your IPFS instance recursively
is the “ipfs dag stat” command.

There’s another API method you might find useful: /api/v0/dag/get. This
will get a DAG node, non-recursively. By default, it returns a JSON
representation of the node with link and data fields. The data field is
base64-encoded bytes. The format of those bytes is dependent upon the
codec of the CID.
The link field is an array of links (CIDs with some extra info).
You can use that JSON output to crawl the DAG. For
example, calling the endpoint with my sample CID
QmSuook5umbYEELYcY9pDmfpmbEwCjNdjutXMi19spYFPg
gives me back the following JSON for the root node:

{“Data”:{“/”:{“bytes”:“CAIY6e+CGSCAgOAVIOnvogM”}},“Links”:[{“Hash”:{“/”:“QmdibA3CZxwQkVFstYGFTY4xBYRRLvbTufHDbTWM5uAVSW”},“Name”:“”,“Tsize”:45623854},{“Hash”:{“/”:“QmZNZaASGSWhQ5E9ACvPLSRrQc5kQizDqr9czhTqHmwq7i”},“Name”:“”,“Tsize”:6864508}]}

I, too, found this stuff confusing until I went into a deep dive. For a
while, I was running a 1+ terabyte package mirror on IPFS, and doing that
efficiently forced me to get a firm grasp on some of the IPFS innards.
It’s content-addressable, so why can’t I fetch a file knowing its hash?
Another way to phrase the answer to that question is that IPFS is
content-addressed at the block level, not at the filesystem level, and a
file or directory on IPFS is just a root block of a DAG.

Sorry if I ramble. I’m basically giving a brain dump in hopes that it
will be useful.

No apoogies necessary, as someone may find that useful even if I don’t.

There are only so many hours in a day and days in a lifetime, so these days I prefer to only go as deep as necessary to accomplish a task, at least wrt programming tasks.

I can’t provide any details on the rare errors my colleag reported.

I still don’t understand, given that you said:

ipfs add -n essentially just reads the content from your file
and does the “chunk and hash the blocks” step without reading the
blockstore or trying to fetch blocks from the network.

why comparing the CID isn’t an adequate way to verify the contents, when all it is based on is essentially the file provided. You said the CID produced by -n isn’t based on info from the network, so if the file is different on the receiving side it should show up as a different CID. Is it or is it not true that the CID will be different if even just one bit of the file changes?

I created a text file using the nano editor with the single ASCII digit 0 + LF. It produces CID:
QmUQcSjQx2bg4cSe2rUZyQi6F8QtJFJb74fWL7D784UWf9

Use nano to change the 0 to a 1 and it produces CID:
QmdytmR4wULMd3SLo6ePF4s3WcRHWcpnJZ7bHhoj3QB13v

I repeated that on a different system the same way and got the same results. No file transfer is involved, yet the CID produced is the same. How then could the CID not uniquely represent the object it represents and not be a useful verification method?

I may have misunderstood something, but yes comparing CIDs is a
verification method.