IPFS-Cluster & IPFS daemon keeps getting killed

I’ve a 6 node IPFS private network and I’m also running them as a cluster.
The IPFS version I use is 0.4.17 and and the cluster-service version is 0.4.0
And I’m running these on t2.small(2GB RAM) AWS Ubuntu instance.

The issue here is I notice that the ipfs daemon and also the ipfs cluster daemon keeps getting killed frequently.

Can you help me in identify the reason for this frequent failure?

Here are the logs of IPFS daemon

07:23:27.615 DEBUG mdns: starting mdns query mdns.go:129
07:23:27.615 DEBUG mdns: Handling MDNS entry: 10.10.5.33:4001 Qmc5KzNtecwyGFW4vXs2tWax8eFeoWJLDGPJyyqRuwrJ86 mdns.go:155
07:23:27.615 DEBUG mdns: got our own mdns entry, skipping mdns.go:163
07:23:32.615 DEBUG mdns: mdns query complete mdns.go:142
07:23:34.756 DEBUG cmds/http: incoming API request: /repo/stat handler.go:88

07:23:36.415 DEBUG mdns: mdns service halting mdns.go:148
07:23:36.415 DEBUG dht: Error unmarshaling data: context canceled dht_net.go:43
07:23:36.416 DEBUG dht: Error unmarshaling data: context canceled dht_net.go:43
07:23:36.416 DEBUG dht: Error unmarshaling data: context canceled dht_net.go:43
07:23:36.416 DEBUG dht: Error unmarshaling data: context canceled dht_net.go:43
Received interrupt signal, shutting down…
(Hit ctrl-c again to force-shutdown the daemon.)
07:23:36.417 DEBUG bitswap: bitswap task worker shutting down… workers.go:82
07:23:36.417 DEBUG bitswap: bitswap task worker shutting down… workers.go:82
07:23:36.417 DEBUG bitswap: bitswap task worker shutting down… workers.go:82
07:23:36.417 DEBUG bitswap: bitswap task worker shutting down… workers.go:82
07:23:36.417 DEBUG bitswap: bitswap task worker shutting down… workers.go:82
07:23:36.417 DEBUG bitswap: bitswap task worker shutting down… workers.go:85
07:23:36.417 DEBUG bitswap: bitswap task worker shutting down… workers.go:85
07:23:36.417 DEBUG bitswap: bitswap task worker shutting down… workers.go:82
07:23:36.417 WARNI swarm2: swarm listener accept error: process closing swarm_listen.go:77
07:23:36.417 WARNI swarm2: swarm listener accept error: accept tcp 0.0.0.0:4001: use of closed network connection swarm_listen.go:77
07:23:36.417 WARNI swarm2: swarm listener accept error: accept tcp [::]:4001: use of closed network connection swarm_listen.go:77
07:23:36.421 INFO core/serve: server at /ip4/127.0.0.1/tcp/8080 terminating… corehttp.go:99
07:23:36.421 INFO core/serve: server at /ip4/127.0.0.1/tcp/5001 terminating… corehttp.go:99
07:23:36.422 INFO core/serve: server at /ip4/127.0.0.1/tcp/8080 terminated corehttp.go:117
07:23:36.422 DEBUG bitswap: provideKeys channel closed workers.go:123
07:23:36.422 INFO core/serve: server at /ip4/127.0.0.1/tcp/5001 terminated corehttp.go:117
07:23:36.422 DEBUG bitswap_ne: bitswap net handleNewStream from <peer.ID QzCrDx> error: connection reset ipfs_impl.go:197
07:23:36.422 INFO bitswap: Bitswap ReceiveError: connection reset bitswap.go:434
07:23:36.422 DEBUG bitswap_ne: bitswap net handleNewStream from <peer.ID XrGa47> error: connection reset ipfs_impl.go:197
07:23:36.422 INFO bitswap: Bitswap ReceiveError: connection reset bitswap.go:434
07:23:36.422 DEBUG bitswap_ne: bitswap net handleNewStream from <peer.ID X2Ryhe> error: connection reset ipfs_impl.go:197
07:23:36.422 INFO bitswap: Bitswap ReceiveError: connection reset bitswap.go:434
07:23:36.422 DEBUG core: core is shutting down… core.go:598
07:23:36.422 DEBUG blockservi: blockservice is shutting down… blockservice.go:316
07:23:36.424 INFO cmd/ipfs: Gracefully shut down daemon daemon.go:352

Here is the IPFS Cluster deamon logs

07:23:35.812 DEBUG basichost: protocol negotiation took 46.273µs basic_host.go:175
07:23:35.812 DEBUG libp2p-raf: QmXcjNQimbqz64xz76r4TfBAs8Fvj45GxcUuHXVTmFxUWg accepted connection from: QmctiUyZ5fpZf1GwqrrUVJ1Mrc4ArNordKLJzbSi44ZK8Q log.go:172
07:23:35.987 DEBUG basichost: protocol negotiation took 22.92µs basic_host.go:175
07:23:35.987 DEBUG libp2p-raf: QmXcjNQimbqz64xz76r4TfBAs8Fvj45GxcUuHXVTmFxUWg accepted connection from: QmctiUyZ5fpZf1GwqrrUVJ1Mrc4ArNordKLJzbSi44ZK8Q log.go:172
07:23:36.169 DEBUG basichost: protocol negotiation took 22.847µs basic_host.go:175
07:23:36.169 DEBUG libp2p-raf: QmXcjNQimbqz64xz76r4TfBAs8Fvj45GxcUuHXVTmFxUWg accepted connection from: QmctiUyZ5fpZf1GwqrrUVJ1Mrc4ArNordKLJzbSi44ZK8Q log.go:172
07:23:36.330 DEBUG basichost: protocol negotiation took 41.543µs basic_host.go:175
07:23:36.330 DEBUG libp2p-raf: QmXcjNQimbqz64xz76r4TfBAs8Fvj45GxcUuHXVTmFxUWg accepted connection from: QmctiUyZ5fpZf1GwqrrUVJ1Mrc4ArNordKLJzbSi44ZK8Q log.go:172
07:23:36.416 INFO cluster: shutting down Cluster daemon.go:185
07:23:36.416 INFO consensus: stopping Consensus component cluster.go:440
07:23:36.416 DEBUG consensus: Raft state is catching up to the latest known version. Please wait… raft.go:411
07:23:36.416 DEBUG consensus: current Raft index: 108/108 raft.go:411
07:23:36.416 DEBUG mapstate: Marshal-- Marshalling state of version 4 codec.go:36
07:23:36.419 INFO raft: NOTICE: Some RAFT log messages repeat and will only be logged once logging.go:71
07:23:36.419 INFO raft: Starting snapshot up to 108 logging.go:52
07:23:36.419 INFO raft: [INFO] snapshot: Creating new snapshot at /home/ubuntu/.ipfs-cluster/raft/snapshots/47967-108-1534317816419.tmp logging.go:52
07:23:36.419 WARNI libp2p-raf: [WARN] Unable to get address for server id QmXcjNQimbqz64xz76r4TfBAs8Fvj45GxcUuHXVTmFxUWg, using fallback address QmXcjNQimbqz64xz76r4TfBAs8Fvj45GxcUuHXVTmFxUWg: libp2p host does not know peer QmXcjNQimbqz64xz76r4TfBAs8Fvj45GxcUuHXVTmFxUWg log.go:172
07:23:36.427 INFO raft: Snapshot to 108 complete logging.go:52
07:23:36.427 INFO monitor: stopping Monitor cluster.go:461
07:23:36.427 INFO restapi: stopping Cluster API cluster.go:466
07:23:36.427 INFO ipfshttp: stopping IPFS Proxy cluster.go:470
07:23:36.427 INFO pintracker: stopping MapPinTracker cluster.go:475
07:23:36.427 DEBUG conn: listener closing: <peer.ID XcjNQi> /ip4/0.0.0.0/tcp/9096 listener.go:146
07:23:36.427 DEBUG conn: listener ignoring conn with temporary err: accept tcp 0.0.0.0:9096: use of closed network connection temp_err_catcher.go:74
07:23:36.427 DEBUG conn: listener closing: <peer.ID XcjNQi> /ip4/0.0.0.0/tcp/9096 listener.go:134
07:23:36.427 WARNI swarm2: swarm listener accept error: peerstream listener failed: listener is closed asm_amd64.s:2361
07:23:36.428 INFO pubsub: pubsub processloop shutting down asm_amd64.s:2361
07:23:36.428 DEBUG conn: listener closed: <peer.ID XcjNQi> /ip4/0.0.0.0/tcp/9096 listen.go:261
07:23:36.428 INFO pubsub: error reading rpc from <peer.ID NdcjFP>: connection reset pubsub.go:154
07:23:36.428 ERROR libp2p-raf: Failed to decode incoming command: connection reset log.go:172
07:23:36.428 INFO pubsub: error reading rpc from <peer.ID fT3SAJ>: connection reset pubsub.go:154
07:23:36.428 INFO pubsub: error reading rpc from <peer.ID ZFk7GT>: connection reset pubsub.go:154
07:23:36.428 INFO pubsub: error reading rpc from <peer.ID ctiUyZ>: connection reset pubsub.go:154
07:23:37.404 DEBUG service: Successfully released execution lock daemon.go:93

Check the system logs. They might be getting killed by the OOM killer because there is not enough memory in the system.

Hello,

There is nothing in the syslog or kern.log which suggests that it was killed by OOM

Hello, the cluster process is shutting down orderly. This only happens when it’s received a signal to do so (SIGINT, SIGTERM or SIGHUP). Same as when doing ctrl-c on the shell.

I suggest you look further in your system, as by all means this looks like an external cause.