Documents not loading through the basic gateway

Hi everyone,

We have created a platform to host images to our IPFS Cluster, take one image and some info related to it, then compile it into a PDF and host the PDF on the Cluster too. We then generate a public link to access the data via the classic gateway.

This procedure worked perfectly since we initiated it a month ago. But oddly, since last week we had issues with PDFs loading indefinitely. What was odd with that behavior, was that only like 10% of the PDFs were actually loading when browsed. I tried a few things like switching distant gateways, using the localhost gateway through the locally run node, even from the IPFS Cluster servers, but no success.

However, since yesterday, when I log into my IPFS-files-management dApp 90% of the images are not loading anymore. I then restarted the VPSs where our Cluster peers are hosted, in fact they had been running without interruption for over a month and a half, so I thought they had some update pending or other stuff that a reboot could have solved. Suprisingly, the reboot made the loading of the images work again, but still no change with the PDFs.

As of this morning, I log back into the dApp, and boom, 50% of the images wont load again and of course PDFs too.

At first, I thought that the more complex structure of PDFs in comparison to a basic image could have played a role, then I remembered that the whole platform is hosted on IPFS lol

I had also thought that maybe it could be caused by the images / pdfs being not loaded for a long period of time and maybe lost in the IPFS network’s buffer, but the issue even happened with freshly uploaded images / pdfs…

If anyone has any idea of how to adress this, I am very open to suggestions :pray:

Update: While I was trying to get links to show examples of successfully and unsuccessfully loading images & PDFs (so opening each image and PDF one time, which amounts to 180 tries) it suddenly started to load again and now everything seems to load perfectly. What I can wildly guess, is that every file needs to be opened regularly to stay “alive” on the network ? That’s the only things I can deduce since I just switched from nothing loading to everything loading perfectly. I need to precise that it’s the first time that loading everything solved the problem since it appeared a week ago.

Update 2: The solving of the problem was short, 10 minutes later the PDFs wont load again…

If the request just stays hanging, it may have to do with discoverability/reachability of your IPFS peers.

  • Perhaps they are not reachable (make sure they can be dialled-in)
  • Perhaps they are not correctly re-providing content because they are hosting to much content (See Reprovider strategies, might want to just provide roots).
  • Perhaps they suffered some other problem (lack of file descriptors sometimes makes node to stop listening for new connections).
  • Perhaps they are having too many connections and the connection manager is killing them too much.

On the other side, perhaps one or several gateway nodes are having some trouble or being hammered at a certain moment of time. Note that gateways have anycast address so unless you inspect the response headers for the request, you will not know if it’s always the same gateway machine that is failing/succeeding.

Also, keep in mind that gateways are rate-limited. If your users hit the rate limit for certain url, this would give you a very clear error (and not hang), so I don’t think it is the case now, but delivery of content via gateway should not be taken for granted.

Hi Hector,

Thanks for you fast answer, sorry for the late one on my side, I made sure to test everything you mentionned.

Changing the reprovider mode to “roots” seems to have solved the problem.

However, the File Descriptor parts seems a bit more obscure to me, to what I’ve seen it’s a problem that was mentionned on github and it seems the max fd was increased to 8192 by default. Is there something else I’m supposed to do on my side ?

Lastly, I understand the rate-limiting of the gateway, but then how could we solve this long-term if it becomes an issue, create our own gateway ?

You need to monitor your logs and your system. What are your specs, how many pins are you hosting, what datastore are you using? Lots of details missing to tell.

Lastly, I understand the rate-limiting of the gateway, but then how could we solve this long-term if it becomes an issue, create our own gateway ?

Yes, or use an alternative gateway, or get your users to run IPFS natively.

Aren’t we doing this if we chose to install local node on Brave Browser?

Yes, in that case, users would be running ipfs natively and it should use the local node to copy things as long as they are using ipfs: urls (or have ipfs companion installed).

Indeed, I’ve seen some error messages can appear on your deamon when related to File Descriptor issues, but the only one I’ve had on my side so far (on only one of my VPS peers, upon restart of the daemon so I can have a view on the eventual errors) is this one :

2021-01-22T10:28:55.910Z ERROR p2p-gorpc go-libp2p-gorpc@v0.1.0/call.go:64 protocol not supported

Though I tried uploading a PDF through this peer just after having the error, and it worked without even prompting a WARNING.

Only entry I have in the datastore’s LOG file today across the 3 peers, is on the same peer I just mentionned uploading the PDF to :

=============== Jan 22, 2021 (UTC) ===============
07:36:12.133973 table@compaction L0·2 → L1·1 S·792KiB Q·205081
07:36:12.147960 table@build created L1@47 N·1245 S·165KiB “/F5…ZQQ,v202732”:"/pr…OME,v201774"
07:36:12.148089 version@stat F·[0 1] S·165KiB[0B 165KiB] Sc·[0.00 0.00]
07:36:12.150714 table@compaction committed F-2 S-626KiB Ke·0 D·10540 T·15.955339ms
07:36:12.154164 table@remove removed @44
07:36:12.154252 table@remove removed @41
07:36:12.154286 table@remove removed @40

Which seems to be a fairly normal report. However, I have a collecion of these same lines on the LOG files across the days :

10:36:11.617042 log@legend F·NumFile S·FileSize N·Entry C·BadEntry B·BadBlock Ke·KeyError D·DroppedEntry L·Level Q·SeqNum T·TimeElapsed

As for the number of pins, we currently have 183 files pinned: 82 PDFs and 101 images. The datastore used by the cluster peers is Badger, default config in service.json.

As for the hardware’s specs, the peer I use for monitoring is an old Asus EEE pc running on lubuntu and connected through WiFi (only 30 mb/s up and 20 mb/s down due to the old WiFi card 's limitations). The storage is the default HDD, I dedicated 200Go of space for the IPFS peer.
The other two peers are on 2 VPSs hosted by Ionos and they each have approx 4Ghz of processing power, 2Go of RAM, 80 Go of storage on SSDs and 400mb/s of network bandwith up and down. The number of open files, inodes and maximum concurrent processes can be customized following the needs. (I’ve already set the nb open files to 8192)

If possible indeed, only one thing, in our current state IPFS content is displayed through OpenSea which basically copies the image to their lh3 cloud. The only thing that users will see through the IPFS gateway is the PDFs when opening the External URL once on the NFT’s page on OpenSea or by clicking a link on our website. That’s why we were content for now with using the gateway, but I understand that in cases of high frequentation it will be problematic.

I am not sure if this is serious (would potentially need context or debug context around that line). Make sure that cluster peers are only talking to other cluster peers. If this is reproducible (i.e. something you see on every restart or when doing something) it would be good to have more info.

That seems like a very low number for the all reprovider to cause issues, but then your specs are not great either so hard to say. Just make sure that your nodes are fully reachable so they can be dialled-in for the content (ipfs id should show "/ipfs/kad/1.0.0" among the Protocols).

Indeed that was an unrelated error, no worries on this side.

Are you talking about my monitoring peer’s specs ? Or even the VPSs ? Also, could such a difference in specs between peers cause issues ?

Well, I do not have this exact line under protocol, instead I have "/ipfs/lan/kad/1.0.0"

This means your node is not dial-able from the outside. You need to open your ports (or make sure upnp is supported). Your node is essentially in client-mode and not providing things, so the gateway will only find it if by chance your node opened a connection to it directly.

Oh ok, so to make my local peer reachable by the other cluster peers, I opened ports 9094 and 9096 through NAT forwarding since my local peer has no dedicated network IP unlike the VPSs. I had first deactivated UPnP but when I made it work I reactivated it. It is even currently active for the cluster peer on port 9096.

Should I open other ports maybe ? I only opened IPFS Cluster related ports to achieve interconnectivity between the peers, I must say I passed on opening other IPFS ports.

Edit: I just checked on the VPS peers and they also have the "/ipfs/lan/kad/1.0.0" line showing, and I’ve also only opened ports 9096 and 9094 on the firewall so it 100% comes from ports as you suggested.

You need to open 4001, and sometimes modify IPFS config Announce part and hardcode the public IP there.

Ok thanks a lot for your help, I will change this and get back to you.

Lastly, is there a way to see the details of a GET HTTP request made to a Cluster Peer ?

Every request that hits the API is logged in the output (restapilog things). There is an option in the config (restapi/http_log_file) that allows to send these to a file.

Ooooohhhh ok thanks a lot for all the help :pray: