Commands hanging indefinitely

Hello everyone,

I’m currently working on setting up an IPFS server for a production application and I could use some assistance with configuring it properly. Here’s the setup and the issue I’m facing:

Setup Details

  • Master Server: AWS m6a.2xlarge machine

    • Specifications:
      • vCPU: 8
      • Memory: 32 GiB
      • Network Performance: Up to 12.5 Gbps
    • Purpose: Handles file uploads
    • Configuration: Running IPFS as a Docker image
  • Cluster Nodes: Two other nodes serving as IPFS gateways on the internet

  • Additional Info: Running IPFS-cluster image on all machines

Problem Description

I encounter an issue where IPFS commands start hanging indefinitely, particularly ipfs add and ipfs pin/ls. Some commands still work, but the overall functionality is severely impaired. Interestingly, resources do not seem to be running out when ConnMgr is enabled, yet the commands still hang. Without these settings, all RAM is eventually used up. Another interesting thing is, if I reset docker image in that state, applications waiting for command will receive answer and CID of the file.

I’ve tried using Kubo versions 0.27, 0.28, and 0.29, but the issue persists across all versions.

Additional Details

  • Repo Stats:
/ # ipfs repo stat
NumObjects: 2390888
RepoSize:   129242800328
StorageMax: 190000000000
RepoPath:   /data/ipfs
Version:    fs-repo@15
  • Full Configuration:
{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": "/ip4/0.0.0.0/tcp/5001",
    "Announce": null,
    "AppendAnnounce": null,
    "Gateway": "/ip4/0.0.0.0/tcp/8080",
    "NoAnnounce": null,
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic-v1",
      "/ip4/0.0.0.0/udp/4001/quic-v1/webtransport",
      "/ip6/::/udp/4001/quic-v1",
      "/ip6/::/udp/4001/quic-v1/webtransport"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic-v1/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "xxx(node 1 address)",
    "xxx(node 2 address)"
  ],
  "DNS": {
    "Resolvers": {}
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "190GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": true
    }
  },
  "Experimental": {
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "HTTPHeaders": {},
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": {
      "xxx.com": {
        "Paths": [
          "/ipfs",
          "/ipns"
        ],
        "UseSubdomains": true
      },
      "xxx.io": {
        "Paths": ["/ipfs", "/ipns"],
        "UseSubdomains": true
      }
    },
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "xxx",
    "PrivKey": "xxx"
  },
  "Internal": {},
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128,
    "UsePubsub": true,
    "MaxCacheTTL": "1m"
  },
  "Migration": {
    "DownloadSources": [],
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": null
  },
  "Pinning": {
    "RemoteServices": {}
  },
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {
    "Interval": "11h0m0s",
    "Strategy": "pinned"
  },
  "Routing": {
    "AcceleratedDHTClient": true,
    "Methods": null,
    "Routers": null
  },
  "Swarm": {
    "AddrFilters": null,
    "ConnMgr": {
      "Enabled": true,
      "LowWater": 1500,
      "HighWater": 2500,
      "GracePeriod": "3m"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "RelayClient": {},
    "RelayService": {},
    "ResourceMgr": {
        "Enabled": true,
        "Limits": {},
        "MaxMemory": "20GiB"
    },
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Specific Issues

  • Command Hanging: ipfs add and ipfs pin/ls commands hang indefinitely
  • Docker Image Unhealthy: Without the ConnMgr settings, the Docker container running IPFS becomes unhealthy frequently and requires manual intervention to reset.

Questions

  1. How to solve issues with IPFS hanging on certain commands?
  2. is it possible to set a timeout for commands?
  3. Are there any recommended configurations or best practices for running IPFS in a production environment, especially regarding connection and resource management?

Any advice or suggestions would be greatly appreciated! Thanks in advance for your help.

Best regards,
Tine

Does ipfs pin ls --stream hang too?

Are you able to recursively enumerate all the files in .ipfs/blocks i.e. with find?

Do ipfs add and ipfs pin ls hang also when running offline (when ipfs daemon is not running).

To me, it sounds like disk-reads might be hanging. In principle, the ConnMgr is not related to this, unless your disk access is hanging because RAM is exhausted with many connections and the SO starts swapping or something.

Now 32GiB is a lot of RAM, and even without ConnMgr, the resource manager should be limiting usage. It would be interested to see what the RAM is used for when is about to be exhausted with a stack dump (kubo/docs/debug-guide.md at master · ipfs/kubo · GitHub).

Hi hector, thank you for your reply.

Eventually I discovered that there must be some issue with file pinning. pin ls --stream command is called (i believe) from ipfs-cluster image, and our applications were calling add --pin true (actually calling http api with equivalent command).
In that case, “pin ls” and “add” commands hangs - also seen with ipfs diag cmds. Later I fixed our apps to call “add” without pinning the file. After that everything runs smoothly. Also by pinning the file to the ipfs-cluster service, the CID is pinned on all my nodes in the cluster.
So as it turnes out its probably not connected with ConnMgr but that still bugs me because this issue appeared lately on the same configuration that it used to work. Maybe it has to do something with the repo size as it became quite large?

Perhaps you were GC’ing? I believe it may lock the pinset for other operations while it is in progress. That might explain why it hangs.

I set "GCPeriod": "720h", and I still got commands hanged, so I’m not sure that this is the reason. Is there any way that I can check if the GC is running?

GC doesn’t run unless you start with --enable-gc. That said, do try to run with ipfs --offline daemon and see if ipfs add --pin true or pin ls still hangs for something.