I have an IPFS cluster that’s managing around 800 pins, which make up about 2TB of data.
I recently set up some graphs to help with monitoring. With the help of these graphs I realized I have some sort of configuration problem to where the VPS either isn’t pinning or isn’t retaining blocks.
That’s another issue I have to solve, but for now I’m just curious about these spikes that appear about once an hour.
The vertical axis shows available bytes.
It looks to me like something is happening that consumes disk space, then immediately the diskspace is reclaimed. What is happening here?
How are the logs obtained ? Is it Kubo’s prometheus output ? Some solution that lookup df usage ?
Can you check the logs, are the nodes “just” crashing ?
Logs are obtained via grafana-agent’s node_exporter integration, using the filesystem collector.
I looked at kubo logs. It seems stable (no crashing.) One thing that sticks out are thousands of ‘context canceled’ errors. I don’t see any kubo logs other than these errors.
Jan 27 20:46:16 baba ipfs[29211]: 2023-01-27T20:46:16.362+0100 ERROR core/commands/cmdenv pin/pin.go:133 context canceled
Jan 27 20:46:16 baba ipfs[29211]: 2023-01-27T20:46:16.365+0100 ERROR core/commands/cmdenv pin/pin.go:133 context canceled
Jan 27 20:46:16 baba ipfs[29211]: 2023-01-27T20:46:16.365+0100 ERROR core/commands/cmdenv pin/pin.go:133 context canceled
Jan 27 20:46:16 baba ipfs[29211]: 2023-01-27T20:46:16.367+0100 ERROR core/commands/cmdenv pin/pin.go:133 context canceled
Jan 27 20:46:16 baba ipfs[29211]: 2023-01-27T20:46:16.367+0100 ERROR core/commands/cmdenv pin/pin.go:133 context canceled
Jan 27 20:46:16 baba ipfs[29211]: 2023-01-27T20:46:16.368+0100 ERROR core/commands/cmdenv pin/pin.go:133 context canceled
Jan 27 20:57:46 baba ipfs[29211]: 2023-01-27T20:57:46.556+0100 ERROR core/commands/cmdenv pin/pin.go:133 context canceled
Jan 27 20:58:16 baba ipfs[29211]: 2023-01-27T20:58:16.499+0100 ERROR core/commands/cmdenv pin/pin.go:133 context canceled
Jan 27 20:58:16 baba ipfs[29211]: 2023-01-27T20:58:16.501+0100 ERROR core/commands/cmdenv pin/pin.go:133 context canceled
I looked at ipfs-cluster logs. They are more diverse than the kubo logs. Many types of logs and not just one type of error like in the case of kubo.
Matching up the timestamp of a spike, this is what I see in the minutes leading up to a spike.
Jan 27 20:37:44 baba ipfs-cluster-service[56924]: 2023-01-27T20:37:44.783+0100 INFO restapilog common/api.go:117 38.242.216.6 - - [27/Jan/2023:20:37:44 +0100] "GET /pins?stream-channels=false HTTP/1.1" 200 728418
Jan 27 20:38:58 baba ipfs-cluster-service[56924]: 2023-01-27T20:38:58.788+0100 INFO crdt go-ds-crdt@v0.3.9/crdt.go:562 Number of heads: 1. Current max height: 2303. Queued jobs: 0. Dirty: false
Jan 27 20:39:04 baba ipfs-cluster-service[56924]: 2023-01-27T20:39:04.072+0100 INFO badger go-ds-badger@v0.3.0/datastore.go:441 Running GC round
Jan 27 20:39:04 baba ipfs-cluster-service[56924]: 2023-01-27T20:39:04.074+0100 INFO badger go-ds-badger@v0.3.0/datastore.go:443 Finished running GC round
garbage collection should not kick in theory: StorageMax: 3.0 TB pin/pin.go:133 context canceled means connection to ipfs-cluster has been lost, I guess they are timing out.
It’s weird that your repo is flashing like that tho.
@hector there isn’t 2.44 TiB on disk, it’s the opposite. It’s 2.44 TiB available. (only 493G on disk which is OS files + ipfs repo.)
My hypothesis is that badger-GC runs hourly, allocates some space to work with, does it’s thing, then frees the allocation when it’s done. I haven’t dug into badger-GC code to know if that’s how it behaves.
My worry is that whatever is causing the spikes is interfering with pinning. With all the files I have pinned, 391 GB seems way too low. My estimation is that my repo should be around 2TB.
I’m expecting the disk availabile bytes to go down over time, but it’s just hovering at around 2.44.
@Jorropo I have 8 CIDs in my pinset that aren’t pinned. I think the blocks are no longer available and the pinning attempts are repeatedly failing. I believe context cancelled might refer to those failing to pin.
We don’t pre-allocate space for blocks, we traverse the dag and stream blocks to the disk as we see them.
If the CIDs were partially available you would expect the pin call to timeout / fail but the blocks would remain in the blockstore (because your gc is 3TB and the default is 90% threshold).
I don’t know, if you want to debug I would try to get a du view when the spike is happening (and one without) to see what files are doing this.
That wasn’t my first thought and I appreciate the idea. I went ahead and re-installed Ubuntu and re-provisioned the cluster peer. The peer is syncing and I’m not seeing the spikes anymore.