I’ve been experimenting with pinning a large number of objects. I was wondering if there are any limits on how many objects can feasibly be pinned to a single kubo node?
In my tests I found that the pin/ls API returns about 2.5 million pins before the connection breaks, so that would be an issue. Let’s say I wanted to support 10 million pins, what other issues might I be running into?
For the sake of discussion, assume that I have unlimited storage, so I’m only concerned about the active management. In particular, I’m trying to understand if there are any internal processes that scale linearly with the number of pins. One such candidate I already found is the garbage collection, but are there other (important?) processes I need to be mindful of?
I’m guessing providing would also be impacted by this. How many pins can a node on a fast network with accelerated DHT (re-)provide per second?
Also, I know that ipfs-cluster is a thing, but even then I’d be curious to understand how many nodes I would need to support X million pins.
My (limited) understanding is that rainbow is acting purely as an HTTP gateway. I need the files to be advertised on the DHT to be accessible through bitswap.
That’s very likely another avenue we are exploring, but we’d like the files to be discoverable through DHT as well.
We had a 100M pins deployment with 24 ipfs-cluster peers. In general, it all depends on how fast your disk is. I’m not sure why connection to Kubo would break at 2.5M pins, but yeah, you need to test as 1) Your hardware: particularly disks 2) Your configuration 3) The amount of retrievals/traffic 4) The amount of writes… all affects the final number of what becomes “too much” for a single Kubo-box to handle.
Thanks, that’s some very helpful data. I think my main concern was around how many (re)provides a single kubo can feasibly handle. In your example, at 4M per node, that would be roughly 50 per second per node for 22h (assuming you only announce direct/recursive pins), which sounds good!
Right, you can also not provide to DHT and use IPNI (cid.contact). You are correct that only reproviding to DHT is a source of high cpu, bandwidth and disk-usage at that level. I am not sure if our nodes were managing that correctly.
I see, makes sense. Thank you! While I have you here, I have a somewhat related question: is it enough to just announce the pins/roots ? Will other nodes in the network know to ask for the remaining blocks over bitswap directly or will I always need to advertise all individual blocks?