Commands hanging indefinitely

Tine · June 13, 2024, 10:09am

Hello everyone,

I’m currently working on setting up an IPFS server for a production application and I could use some assistance with configuring it properly. Here’s the setup and the issue I’m facing:

Setup Details

Master Server: AWS m6a.2xlarge machine
- Specifications:
  - vCPU: 8
  - Memory: 32 GiB
  - Network Performance: Up to 12.5 Gbps
- Purpose: Handles file uploads
- Configuration: Running IPFS as a Docker image
Cluster Nodes: Two other nodes serving as IPFS gateways on the internet
Additional Info: Running IPFS-cluster image on all machines

Problem Description

I encounter an issue where IPFS commands start hanging indefinitely, particularly ipfs add and ipfs pin/ls. Some commands still work, but the overall functionality is severely impaired. Interestingly, resources do not seem to be running out when ConnMgr is enabled, yet the commands still hang. Without these settings, all RAM is eventually used up. Another interesting thing is, if I reset docker image in that state, applications waiting for command will receive answer and CID of the file.

I’ve tried using Kubo versions 0.27, 0.28, and 0.29, but the issue persists across all versions.

Additional Details

Repo Stats:

/ # ipfs repo stat
NumObjects: 2390888
RepoSize:   129242800328
StorageMax: 190000000000
RepoPath:   /data/ipfs
Version:    fs-repo@15

Full Configuration:

{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": "/ip4/0.0.0.0/tcp/5001",
    "Announce": null,
    "AppendAnnounce": null,
    "Gateway": "/ip4/0.0.0.0/tcp/8080",
    "NoAnnounce": null,
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic-v1",
      "/ip4/0.0.0.0/udp/4001/quic-v1/webtransport",
      "/ip6/::/udp/4001/quic-v1",
      "/ip6/::/udp/4001/quic-v1/webtransport"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic-v1/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "xxx(node 1 address)",
    "xxx(node 2 address)"
  ],
  "DNS": {
    "Resolvers": {}
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "190GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": true
    }
  },
  "Experimental": {
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "HTTPHeaders": {},
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": {
      "xxx.com": {
        "Paths": [
          "/ipfs",
          "/ipns"
        ],
        "UseSubdomains": true
      },
      "xxx.io": {
        "Paths": ["/ipfs", "/ipns"],
        "UseSubdomains": true
      }
    },
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "xxx",
    "PrivKey": "xxx"
  },
  "Internal": {},
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128,
    "UsePubsub": true,
    "MaxCacheTTL": "1m"
  },
  "Migration": {
    "DownloadSources": [],
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": null
  },
  "Pinning": {
    "RemoteServices": {}
  },
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {
    "Interval": "11h0m0s",
    "Strategy": "pinned"
  },
  "Routing": {
    "AcceleratedDHTClient": true,
    "Methods": null,
    "Routers": null
  },
  "Swarm": {
    "AddrFilters": null,
    "ConnMgr": {
      "Enabled": true,
      "LowWater": 1500,
      "HighWater": 2500,
      "GracePeriod": "3m"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "RelayClient": {},
    "RelayService": {},
    "ResourceMgr": {
        "Enabled": true,
        "Limits": {},
        "MaxMemory": "20GiB"
    },
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Specific Issues

Command Hanging: ipfs add and ipfs pin/ls commands hang indefinitely
Docker Image Unhealthy: Without the ConnMgr settings, the Docker container running IPFS becomes unhealthy frequently and requires manual intervention to reset.

Questions

How to solve issues with IPFS hanging on certain commands?
is it possible to set a timeout for commands?
Are there any recommended configurations or best practices for running IPFS in a production environment, especially regarding connection and resource management?

Any advice or suggestions would be greatly appreciated! Thanks in advance for your help.

Best regards,
Tine

hector · June 16, 2024, 8:33pm

Does ipfs pin ls --stream hang too?

Are you able to recursively enumerate all the files in .ipfs/blocks i.e. with find?

Do ipfs add and ipfs pin ls hang also when running offline (when ipfs daemon is not running).

To me, it sounds like disk-reads might be hanging. In principle, the ConnMgr is not related to this, unless your disk access is hanging because RAM is exhausted with many connections and the SO starts swapping or something.

Now 32GiB is a lot of RAM, and even without ConnMgr, the resource manager should be limiting usage. It would be interested to see what the RAM is used for when is about to be exhausted with a stack dump (kubo/docs/debug-guide.md at master · ipfs/kubo · GitHub).

Tine · June 19, 2024, 8:15pm

Hi hector, thank you for your reply.

Eventually I discovered that there must be some issue with file pinning. pin ls --stream command is called (i believe) from ipfs-cluster image, and our applications were calling add --pin true (actually calling http api with equivalent command).
In that case, “pin ls” and “add” commands hangs - also seen with ipfs diag cmds. Later I fixed our apps to call “add” without pinning the file. After that everything runs smoothly. Also by pinning the file to the ipfs-cluster service, the CID is pinned on all my nodes in the cluster.
So as it turnes out its probably not connected with ConnMgr but that still bugs me because this issue appeared lately on the same configuration that it used to work. Maybe it has to do something with the repo size as it became quite large?

hector · June 19, 2024, 9:02pm

Perhaps you were GC’ing? I believe it may lock the pinset for other operations while it is in progress. That might explain why it hangs.

Tine · June 24, 2024, 11:44am

I set "GCPeriod": "720h", and I still got commands hanged, so I’m not sure that this is the reason. Is there any way that I can check if the GC is running?

hector · June 24, 2024, 7:31pm

GC doesn’t run unless you start with --enable-gc. That said, do try to run with ipfs --offline daemon and see if ipfs add --pin true or pin ls still hangs for something.

Topic		Replies	Views
Hang when trying to transfer files between nodes Help go-ipfs	2	566	April 20, 2018
Is the kubo command line thread safe? Kubo	2	233	May 31, 2023
Struggling to Pin an IPNS Path Using Kubo Kubo	1	94	January 7, 2025
Always timeout when trying to access a file Help	2	174	September 5, 2022
Behavior I can't explain Help	4	885	December 22, 2019