I’m currently experimenting with IPFS Cluster to assess its suitability for environments with resource constraints, such as intermittent connectivity, low bandwidth, and high latency. I have set up three Linux VMs on the same PC, each running IPFS, ipfs-cluster-ctl, and ipfs-cluster-service, and they all share the same cluster secret.
I have a couple of questions:
I attempted to modify the service.json file to reduce the frequency of heartbeat messages. However, after restarting the ipfs-cluster-service daemon, I did not observe any changes in activity frequency when monitoring with Wireshark. Could anyone provide guidance on the correct way to make this adjustment?
Is there a way to reduce the overall bandwidth usage of IPFS through the configuration file, aside from altering the frequency of heartbeat messages?
Are there any existing research papers that investigate how the number of nodes in an IPFS Cluster affects the bandwidth requirements for normal operation?
I appreciate any insights or advice you can provide!
Hi @Jiajunn, just thought I’d mention the Traffic Control (tc) tool in Linux if you’re not already familiar with it. It can be helpful for limiting bandwidth, and introducing jitter and loss to a network interface. You can read more about it in the man pages for tc, and the following online reference is also a decent one: https://tldp.org/HOWTO/Traffic-Control-HOWTO/
Hi @cewood , thanks for the tip! I’ve actually been experimenting with the Traffic Control (tc) tool in Linux to test the IPFS Cluster. I noticed that when setting the throughput to 22Kbps, with a 5k burst and 5k limit, the peer loses connection to the cluster. I was wondering if there are any configuration options or methods to reduce the IPFS Cluster’s bandwidth usage so it can function in such an environment, or if modifying the source code would be necessary.
Hello @Jiajunn … leaving out Kubo bandwidth usage (dht, bitswap retrievals etc.), ipfs-cluster-specific bandwidth corresponds mostly to pubsub and correlates heavily with the number of peers in the cluster.
In the last stable version an ipfs-cluster-ctl health bandwidth command was added so you can check by yourself.
The bandwidth usage breaks down to how each pubsub peers broadcasts a message and how long it takes for that message to reach all the peers. The most efficient thing is to broadcast from 1 to everyone at once, but as soon as your number of peers is in the thousands, this stops being an option. Increased heartbeat intervals etc. but in the end pubsub settings should optimize for the cluster characteristics.
Hi @hector, appreciate the help! I will look into tuning the pubsub configuration values. Now I’ve been trying to use tc qdisc add dev [interface] root tbf rate [rate] burst [burst] latency [latency] to limit the throughput of my IPFS Cluster node, investigating the throughput at which it disconnects from the remaining cluster. Currently, I check for disconnection by using ipfs-cluster-ctl peers ls and checking whether there is Context Error in the output (since it means the request timed out). Is there a more straightforward way to check whether a node is disconnected from the rest of the cluster?
It is a bit more complicated… cluster peers form a swarm and not every peer needs to be connected to every other. They do, however, pubsub “metrics” and every peer should receive the metrics from all the others. Most of the improvements I mentioned were around optimizing this to reduce bandwidth requirements because in the end the baseline was quite moderate on large clusters.
(There’s a balance to strike between how many direct connections to keep open vs. how many re-broadcasts of the same message are needed by pubsub to cover the whole swarm, and it heavily depends on amount of peers.)
As such, there is no concept of “peers has disconnected from the cluster” but rather a notion of “peer metrics are not arriving”. If they do not arrive, other peers should print “alerts” in the logs when the latest known metrics expire and are not renewed. ipfs-cluster-ctl health alerts also show the latest expired metrics, which would correspond to peers having connectivity issues.
In every peer, the current “peerset” is made of the list of peers that we have received “ping” metrics from (and have not expired). ipfs-cluster-ctl peers ls takes that and opens direct connections to retrieve peer information from all the peers in that “peerset”. A context error may mean that the peer is down, or may mean that that particular, previously inexistent direct connection to the peer could not be opened. But if the peer in question had not had “live” metrics, it would have not been contacted at all (so the cluster peer from which you are running the command would report a lower number of total peers than there actually are).
In the specific case of context error, sometimes some peers are just slow wrt connection setup, and increasing dial_peer_timeout (default 3s) in the configuration helps.