I am brand new to the world of ipfs, so please forgive me if I misuse any terms, or use terms that aren’t applicable to ipfs.
I want to build a tool that makes API requests and stores their responses, kind of like a metadata scraper for an API.
Specifically, I want to make a tool that retrieves .json responses from known addresses in the form of ipfs://exampleCID/index
For example:
ipfs://exampleCID/1 (startIndex)
ipfs://exampleCID/2
ipfs://exampleCID/3
.
.
ipfs://exampleCID/n (endIndex)
The simple solution would just be to fire a bunch of requests from startIndex to endIndex, but usually the simplest solution is not the most optimal…
So I was wondering… what is the best way to accomplish this? Is there some black magic batching where I can download the entire directory or parallel request feature with ipfs?
Also, what is the best way to actually send the equivalent to a GET request? Without the use of a specialized client, the only way that I can find right now is to send a regular GET request to a gateway of some sort, like ipfs.io/ipfs/
I am planning to use Django for a backend, although if anyone knows a framework that works better with ipfs, or is better suited for this kind of task, please do not hesitate to recommend it.
(Reposting notes from Matrix + new ones, in case it is useful for other folks trying to build mental model for using IPFS in their app’s infra/architecture)
If you mean seeking within a single file, ipfs cat supports offset and length parameters that allow you to read arbitrary byte ranges without fetching the entire DAG. HTTP gateways support regular range-requests (HTTP range requests - HTTP | MDN) and, like ipfs cat, translate them into fetching the minimal subset of a DAG.
If you mean “fetching multiple files from a big directory” then you can either fetch them in parallel via separate requests, or download the entire directory as a TAR (ipfs get -a) or CAR archive (ipfs dag export) with the entire directory, or each file separately. In the future, IPLD Selectors will let you fetch a subset of a DAG in a more flexible manner.
You wrote that you want to store metadata on IPFS – you may want to look into ipfs dag put|get (make sure to use new API from go-ipfs 0.10.0) and experiment with --output-codec=dag-json and --store-codec=dag-cbor for JSON that is stored in binary form (CBOR) at rest.
For fetching multiple files from a big directory, which of those methods would you personally recommend? What are the nuances between the two methods you described that are currently available?
Depends. Do you trust the gateway to return data matching the requested CID? If you run your own gateway, then it is not a concern, and you can use regular HTTP GET for /ipfs/CID/foo for JSON files (or /api/v0/dag/get if you stored data as dag-cbor) and see if that is enough performance-wise.
If you don’t want to trust a remote gateway, then you need to run a local IPFS node: fetch data as CAR via /api/v0/dag/export and import it to the node via dag import to verify hashes match the requested content, and then you use local node as a “trusted caching layer”.
For my use case I think it should be fine to trust the ipfs.io gateway to return the right data.
As for fetching multiple files from a big directory… I have a couple of questions:
With regards to fetching them in parallel via separate requests, the directories I plan to download usually have between 8000 ~ 20,000 json files. Surely there is some sort of rate limiting in place that prevents this from being feasible in terms of speed? Is there any way to get around this?
Is there a way to perform something similar to ipfs get -a without setting up a local ipfs client? i.e. can this be done with regular HTTP GET requests? Adding an ipfs client to my backend might add too much complexity (I am trying to keep it as a simple API).
Is there any difference in expected performance between fetching them in parallel via separate requests, versus downloading the entire directory in one request?
I forgot to mention, for my use case, I only need to download everything in a directory once, as fast as possible.
Yes. You may hit throttling on gateways and get HTTP 429 Too Many Requests error response.
To get around this:
(A) run your own gateway, or pay someone like Pinata or Infura to run one for you.
(B) make your download logic smarter: back-off and retry as soon you get the first 429 error.
Is there a way to perform something similar to ipfs get -a without setting up a local ipfs client?
[…] Is there any difference in expected performance between fetching them in parallel via separate requests, versus downloading the entire directory in one request?
[…] I only need to download everything in a directory once, as fast as possible.
In this case, you will get the most of content-addressing benefits if you fetch the entire thing as CAR and unpack it locally using something like ipfs-car (which will verify hashes – see Fetch and locally verify files from a IPFS gateway over http example).