Memory leaks in 0.24

Since I have upgraded from 0.20 to 0.24 it looks like memory leaking increased dramatically. We used to have it before but server did hold up for at least a few months. After upgrade about a week ago (no additional config changes), ipfs fills up all the memory in like 3-4 days now.

I haven’t posted any issued in GitHub, but want to ask if anyone else observes similar behaviour and maybe there are some server options that might help eliminating the issue

Here is mem usage chart (roughly around beginning of December we’ve upgraded to 0.24)

I’m not aware of anything, please go ahead open on github and you post ipfs diag profile when it at the highest.

After a restart few hours ago, I’ve added a chart for ipfs daemon specifically. Mem usage keeps climbing up at a decent pace (the number is %, machine has 8G ram)

Diag profile is 40.7 MB, is this normal? Ok I see the whole ipfs binary is included in there, so that’s unzipped 90Mb

I’ll see if I can catch the state with max mem usage that I can still ssh into the box to get diag info out.

Yeah we need the binary because the profile contains instruction pointer stack traces information.
So then by following the instruction pointers to the elf we can get line number debug info and figure out the function names.

For official binaries from dist.ipfs.tech we could omit this TBH, would need a script which pulls out the same version from dist.ipfs.tech from the version file.

thx

I’ve updated from there, yep, probably when created diag can run checksum over and if matched known - skip inclusion.

I’ll leave it running for another day and get the diag snapshot

Ok but … how do you include the checksum of the binary inside the binary without also changing the checksum of the binary ?

I guess we could skip the checksum when checksuming the binary but then it’s not as easy as just take the input file and hash it, need to do symbol resolution to know the couple of bytes we must not hash.

I was thinking to have a special -tags ipfs/distribution which blindly assumes this is ok to do.

The charts are looking creepy today

I cannot attach files here so I pinned the report

added QmWAfBzgpyUywTwP28MuRA78GCC12heQGLT58UAn9N9YBg ipfs-profile-2023-12-11T03_46_33Z_00.zip

So this should work

ipfs cat QmWAfBzgpyUywTwP28MuRA78GCC12heQGLT58UAn9N9YBg > ipfs-profile-2023-12-11T03_46_33Z_00.zip

Interesting, to me it seems like you have many thousands connections open.
What does ipfs swarm peers | wc -l reports ?

ipfs swarm peers | wc -l
459

at this moment (I did have to restart ipfs service as it was too close to eat all the memory)

Can you capture an other profile with number of peers when it’s using a lot again please ?
The profile shows that quic is holding on to lots of connection objects, I would like to know if it is because we have lots of connections open or because we don’t properly clean up dead connections.

Sure I will add a metric to the chart to track peers. And will see how those charts are correllated

Considering our usage patterns have not changed at all over past many months and the only change that happened a week ago was upgrade to 0.24 that leaves me with two realistic options:

  1. some code issue with 0.24 upgrade
  2. some old config options that we haven’t changed that lead to this problem

That’s interesting. I have checked peers recently and it was around 4.5-4.7k
I did run command ipfs swarm peers few times and number is now went down to 1.5k (as well as memory!) without ipfs restarts.

Now we have charts to monitor mem usage and peer count, so I can get info over time as more data accumulated.

(Would be interesting if the peers get cleaned up by calling ipfs swarm peers somehow…)

Can you post ipfs config show too please ? (it excludes private keys)

With some of the values/ips/peer ids cleaned up:

ipfs config show
{

  "API": {
    "HTTPHeaders": {
      "Access-Control-Allow-Credentials": [
        "true"
      ],
      "Access-Control-Allow-Headers": [
        "Authorization,Accept,Origin,DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Content-Range,Range"
      ],
      "Access-Control-Allow-Methods": [
        "GET,POST,OPTIONS,PUT,DELETE,PATCH"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    }
  },
  "Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5002",
    "Announce": null,
    "AppendAnnounce": null,
    "Gateway": "/ip4/127.0.0.1/tcp/8080",
    "NoAnnounce": [
      "CLEANED UP"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic-v1",
      "/ip4/0.0.0.0/udp/4001/quic-v1/webtransport",
      "/ip6/::/udp/4001/quic-v1",
      "/ip6/::/udp/4001/quic-v1/webtransport"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/Qm******",
    "/dnsaddr/bootstrap.libp2p.io/p2p/Qm******",
    "/dnsaddr/bootstrap.libp2p.io/p2p/Qm******",
    "/dnsaddr/bootstrap.libp2p.io/p2p/Qm******",
    "/ip4/x.x.x.x/tcp/4001/p2p/Qm******",
    "/ip4/x.x.x.x/udp/4001/quic-v1/p2p/Qm******"
  ],
  "DNS": {
    "Resolvers": {}
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "10GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false
    }
  },
  "Experimental": {
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "HTTPHeaders": {
      "Access-Control-Allow-Credentials": [
        "true"
      ],
      "Access-Control-Allow-Headers": [
        "Authorization,Accept,Origin,DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Content-Range,Range"
      ],
      "Access-Control-Allow-Methods": [
        "GET,POST,OPTIONS,PUT,DELETE,PATCH"
      ]
    },
    "NoDNSLink": false,
    "NoFetch": true,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "CLEANED UP"
  },
  "Internal": {},
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128,
    "UsePubsub": true
  },
  "Migration": {
    "DownloadSources": [],
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": null
  },
  "Pinning": {
    "RemoteServices": {}
  },
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {},
  "Routing": {
    "AcceleratedDHTClient": true,
    "Methods": null,
    "Routers": null
  },
  "Swarm": {
    "AddrFilters": [
      "CLEANED UP"
    ],
    "ConnMgr": {},
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "RelayClient": {},
    "RelayService": {},
    "ResourceMgr": {},
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

around 12h of charts now. Looks like just by constantly running ipfs swarm peers (as zabbix does to read values) the number of peers no longer hangs at high mark and gets cleaned up.

This is the accelerated dht client hourly crawl.
Similar to this Accelerated DHT Client causes OOM kill upon start of IPFS, ResourceMgr.MaxMemory ignored · Issue #9990 · ipfs/kubo · GitHub

1 Like

Thanks.
Today at some point memory just jumped up and stayed up there, no visible difference in peers count though.
Next jump like that will brick the container.

I’ve run diag again

added QmYpzYJ7PkS2aWre2qCnj6kUSfkE54vcDAqydBERY4ySEP ipfs-profile-2023-12-12T11_29_14Z_00.zip

download:

 ipfs cat QmYpzYJ7PkS2aWre2qCnj6kUSfkE54vcDAqydBERY4ySEP > ipfs-profile-2023-12-12T11_29_14Z_00.zip

I haven’t found better solution than just restart ipfs daemon from time to time
The longest it lasted was about 6-7 days until ec2 instance became unresponsive and needed a reboot.