How can I search a file in IPFS without knowing QmHash?

I am using a website’s service saving some plain text (html) public content, was choosing this website because they claimed saved all content in IPFS, and gave me the QmHash for each content I shared there, it served well until recently this website released a newer version software which has a BUG causing truncated file to replace original content, the original content was shared on 2020-11-18 and it’s still there in the IPFS they have pin’ed (because they claimed pinned all IPFS content never delete any of those), and the truncating content problem happened the next day, now this website is only telling me the revised (truncated) version’s QmHash to me, which has only part of the original content I shared there

If I have access to that website’s IPFS server’s local disk, I should be able to do full disk search via Unix tools like grep to search some unique keywords from the content I shared, but this website is not willing to give me access to their server

I have tried the public https://ipfs-search.com/#/search but it has very poor search quality, many of my publicly shared content existed for months years, can be accessed with QmHash via any IPFS gateway but from ipfs-search if search by title or any keywords, the ipfs-search almost always gives Nothing found

Wonder any other way can search the lost QmHash with part of content, from the truncated content, to get the first version of my content? I am sure it still exists there in IPFS, just lost QmHash,

and I know the original revision was put on IPFS on the day 2020-11-18, and can be further narrow’ed down to a particular hour, does this help to find it out?
it is fully plain text in html format, and I have almost first half of the content

Hello,

As you mention, if you had access to the IPFS repository (via the HTTP API) where this content was stored, it would be pretty trivial to find the CID, but you don’t have access.

If this content is not indexed by any search engine (like ipfs-search, but there are others), then i’m afraid you’re gonna have a hard time finding that file back, but if you know their IPFS node’s PeerID, and depending on how their node is configured, you might have a little chance …

IPFS nodes periodically announce to the network which CIDs they “provide” (objects that they have stored in the repo and can deliver to other nodes). If you know the Peer ID of this node, and listen for these “provide” messages (with a simple script), you’ll have access to the CIDs this node provides. Then, what would be wise is to first perform an object stat on each CID (to know the size of the object … if the size is much larger than what this file should be then ignore it), then fetch the file’s content from its CID. Now you can perform a search on the file’s contents, and if it matches you now have the CID …

Now, waiting for this node to announce the CID you’re waiting for could take some time (depending on their reprovider strategy, by default it’s every 12 hours). What you can do is first write the script and check that it works (filtering with another PeerID, of a node that you control). Then once you’re ready, politely ask the people running the website to trigger their reprovider by running ipfs bitswap reprovide :slight_smile: If they don’t, then it’ll take more time.

Thanks to @reload for all of that ; I am starting to look at what IPFS gives in the API, wonder does it have:

  1. for a given CID, lookup which node is pinning it?
    (because this website’s IPFS server is still continuously releasing new content, I can take a list of very fresh new CIDs from this site, is there an API to reversely lookup which IPFS server is offering the new CIDs? to figure out maybe one or multiple of their IPFS servers)
  2. for a given IPFS server, list all its pinning CIDs in timestamp descending order? (and more metadata like filesize, and etc… ?)

Thanks in advance

To lookup which IPFS node provides a particular CID, use the “find providers” command: ipfs dht findprovs

Regarding your second question: i’m not aware of a simple solution to achieve this, maybe someone will answer this one. One thing is sure: you won’t get timestamps, AFAIK IPFS doesn’t store time metadata for the objects in the repository.

Hey @ dve2j9wkd,

Yes you can probably achieve 2) but not synchronously, and with an unreliable accuracy (will depend on how the node is configured). A go-ipfs node is by default configured with a Reprovider strategy that will publish all CIDs it can provide (all pinned data) every 12 hours, see the Reprovider documentation

If this node uses the default configuration then it will publish its records, just leave your script running for a few days.

thanks; that looping all 50,000 shared articles and check each one by one took a few hours and finished the task, found the one wanted

thanks again for all the advice

You’re welcome. Nice one! Obsession rewards.

@reload I am interesting in such a script for some tests I want to run. I guess this is directly at the libp2p level, right? I am not really familiar with the library. Can I find somewhere in the doc the basics to write such a script or does one already exists?
Thanks