What are these hourly spikes in disk usage?

I have an IPFS cluster that’s managing around 800 pins, which make up about 2TB of data.

I recently set up some graphs to help with monitoring. With the help of these graphs I realized I have some sort of configuration problem to where the VPS either isn’t pinning or isn’t retaining blocks.

That’s another issue I have to solve, but for now I’m just curious about these spikes that appear about once an hour.

The vertical axis shows available bytes.

Screenshot 2023-01-27 at 11-44-20 Capture Status - Futureporn - Dashboards - Grafana

Screenshot 2023-01-27 at 11-44-47 Capture Status - Futureporn - Dashboards - Grafana

It looks to me like something is happening that consumes disk space, then immediately the diskspace is reclaimed. What is happening here?

Whats the size of your repo ?
(ipfs repo stat)

How are the logs obtained ? Is it Kubo’s prometheus output ? Some solution that lookup df usage ?
Can you check the logs, are the nodes “just” crashing ?

root@baba:~# sudo -u ipfs ipfs repo stat --human
NumObjects: 943756
RepoSize:   391 GB
StorageMax: 3.0 TB
RepoPath:   /home/ipfs/.ipfs
Version:    fs-repo@12

Logs are obtained via grafana-agent’s node_exporter integration, using the filesystem collector.

I looked at kubo logs. It seems stable (no crashing.) One thing that sticks out are thousands of ‘context canceled’ errors. I don’t see any kubo logs other than these errors.

Jan 27 20:46:16 baba ipfs[29211]: 2023-01-27T20:46:16.362+0100        ERROR        core/commands/cmdenv        pin/pin.go:133        context canceled
Jan 27 20:46:16 baba ipfs[29211]: 2023-01-27T20:46:16.365+0100        ERROR        core/commands/cmdenv        pin/pin.go:133        context canceled
Jan 27 20:46:16 baba ipfs[29211]: 2023-01-27T20:46:16.365+0100        ERROR        core/commands/cmdenv        pin/pin.go:133        context canceled
Jan 27 20:46:16 baba ipfs[29211]: 2023-01-27T20:46:16.367+0100        ERROR        core/commands/cmdenv        pin/pin.go:133        context canceled
Jan 27 20:46:16 baba ipfs[29211]: 2023-01-27T20:46:16.367+0100        ERROR        core/commands/cmdenv        pin/pin.go:133        context canceled
Jan 27 20:46:16 baba ipfs[29211]: 2023-01-27T20:46:16.368+0100        ERROR        core/commands/cmdenv        pin/pin.go:133        context canceled
Jan 27 20:57:46 baba ipfs[29211]: 2023-01-27T20:57:46.556+0100        ERROR        core/commands/cmdenv        pin/pin.go:133        context canceled
Jan 27 20:58:16 baba ipfs[29211]: 2023-01-27T20:58:16.499+0100        ERROR        core/commands/cmdenv        pin/pin.go:133        context canceled
Jan 27 20:58:16 baba ipfs[29211]: 2023-01-27T20:58:16.501+0100        ERROR        core/commands/cmdenv        pin/pin.go:133        context canceled

I looked at ipfs-cluster logs. They are more diverse than the kubo logs. Many types of logs and not just one type of error like in the case of kubo.

Matching up the timestamp of a spike, this is what I see in the minutes leading up to a spike.

Jan 27 20:37:44 baba ipfs-cluster-service[56924]: 2023-01-27T20:37:44.783+0100        INFO        restapilog        common/api.go:117        38.242.216.6 - - [27/Jan/2023:20:37:44 +0100] "GET /pins?stream-channels=false HTTP/1.1" 200 728418
Jan 27 20:38:58 baba ipfs-cluster-service[56924]: 2023-01-27T20:38:58.788+0100        INFO        crdt        go-ds-crdt@v0.3.9/crdt.go:562        Number of heads: 1. Current max height: 2303. Queued jobs: 0. Dirty: false
Jan 27 20:39:04 baba ipfs-cluster-service[56924]: 2023-01-27T20:39:04.072+0100        INFO        badger        go-ds-badger@v0.3.0/datastore.go:441        Running GC round
Jan 27 20:39:04 baba ipfs-cluster-service[56924]: 2023-01-27T20:39:04.074+0100        INFO        badger        go-ds-badger@v0.3.0/datastore.go:443        Finished running GC round

Garbage Collection?

Can you explain the difference between the 391GB in the repository and the 2.44 TiB on disk?

It could be a badger-GC process, it is not even clear how much size the spikes are about.

garbage collection should not kick in theory: StorageMax: 3.0 TB
pin/pin.go:133 context canceled means connection to ipfs-cluster has been lost, I guess they are timing out.
It’s weird that your repo is flashing like that tho.

@hector there isn’t 2.44 TiB on disk, it’s the opposite. It’s 2.44 TiB available. (only 493G on disk which is OS files + ipfs repo.)

My hypothesis is that badger-GC runs hourly, allocates some space to work with, does it’s thing, then frees the allocation when it’s done. I haven’t dug into badger-GC code to know if that’s how it behaves.

My worry is that whatever is causing the spikes is interfering with pinning. With all the files I have pinned, 391 GB seems way too low. My estimation is that my repo should be around 2TB.

I’m expecting the disk availabile bytes to go down over time, but it’s just hovering at around 2.44.

@Jorropo I have 8 CIDs in my pinset that aren’t pinned. I think the blocks are no longer available and the pinning attempts are repeatedly failing. I believe context cancelled might refer to those failing to pin.

[redacted CID]:
    > baba                 : PIN_QUEUED | 2023-01-28T03:02:24.740826806+01:00 | Attempts: 1687 | Priority: false
[redacted CID]:
    > baba                 : PIN_QUEUED | 2023-01-28T03:02:24.741466769+01:00 | Attempts: 1687 | Priority: false
[redacted CID]:
    > baba                 : PINNING | 2023-01-28T03:02:24.735521874+01:00 | Attempts: 1688 | Priority: false
[redacted CID]:
    > baba                 : PINNING | 2023-01-28T03:02:24.736383982+01:00 | Attempts: 1688 | Priority: false
[redacted CID]:
    > baba                 : PINNING | 2023-01-28T03:02:24.737921655+01:00 | Attempts: 1688 | Priority: false
[redacted CID]:
    > baba                 : PIN_QUEUED | 2023-01-28T03:02:24.738213641+01:00 | Attempts: 1687 | Priority: false
[redacted CID]:
    > baba                 : PIN_QUEUED | 2023-01-28T03:02:24.73866161+01:00 | Attempts: 1687 | Priority: false
[redacted CID]:
    > baba                 : PINNING | 2023-01-28T03:02:24.739870775+01:00 | Attempts: 1688 | Priority: false

Could it be that disk allocation for pins is occurring, and because they are failing to pin, the disk allocation is reclaimed?

I realized one of my ipfs cluster peers has high load. Next thing I’m going to try is upgrading that peer to see if the situation improves.

We don’t pre-allocate space for blocks, we traverse the dag and stream blocks to the disk as we see them.
If the CIDs were partially available you would expect the pin call to timeout / fail but the blocks would remain in the blockstore (because your gc is 3TB and the default is 90% threshold).

I don’t know, if you want to debug I would try to get a du view when the spike is happening (and one without) to see what files are doing this.

Any GC process is independent from pinning and does not interfere.

It it still unclear how many bytes those spikes are showing, but it could be anything in the system. Probably not even related to Kubo or cluster.

The errors you see in the logs are typical when pins timeout (cluster will cancel the pin request if no progress is seen).

1 Like

That wasn’t my first thought and I appreciate the idea. I went ahead and re-installed Ubuntu and re-provisioned the cluster peer. The peer is syncing and I’m not seeing the spikes anymore.