We have an IPFS node (running on docker) which has json objects(as files) pinned along with DAG blocks that form a long DAG chain. Number of json files pinned would be approx 750 for 2 minutes. Similarly the number of DAG blocks that get created also are ~750 for every 2 minutes. Right now this node has about 4 million pins.
We are noticing a constant RAM utilization increase over days of running and eventually ipfs ends up completely utilizing entire RAM and gets OOM killed and there is no recovery from there. Even if we restart the container, it soon takes up entire RAM.
Attached screenshot shows RAM utilization increase over past week.
Need help in figuring out as to what could be causing slow increase in RAM utilization and what can be done to reduce it.
How does go-ips determine what needs to be kept in memory and flush out which data? Does it use some kind of LRU caching mechanism based on last-access for CID?
Does it depend on IPLD data-structures used in DAG blocks (as we have long DAG chains forming over time where new block has link to previous one along with a data snapshot CID)?
Note: On a different node which has also many million pins, have unpinned lot of data to see if this affects RAM utilization (which was close to 2.5GB) and noticed that there was no impact.
Particularly enabling AcceleratedDHTClient. Are your pins all recursive? If you have a very long DAG, and each item in that chain is pinned recursively that’s probably not a good pattern. Particularly you may also want to change the Reprovider Strategy to “roots” (kubo/config.md at master · ipfs/kubo · GitHub).
Thanks @hector , will try these configurations and get back.
Each item is pinned recursively. But the files (i.e snapshots) don’t have any link with any other object and are added at root /ipfs only.
But the DAG’s are becoming very long and each block is pinned. We are thinking of having a pruning mechanism to prevent DAG’s from growing beyond 7 days. That should help with this RAM issue hopefully.
Just out of curiosity, what strategy does go-ipfs use to determine which items should be in memory and which items are only in disk? Does it have anything to do with pinned/unpinned content?
It’s not that things are in memory or disk. Are you using badger? Then badger might be the culprit. Otherwise perhaps your nodes cannot reprovide fast enough and goroutines are queuing up causing memory creep.
We are using the default data-store which is flatfs.
Ok, Is there anyway to determine that reproviding delay is causing this memory creep?
Also, i will try to play with the reprovider config to see if something helps.
In one of the nodes where memory used by IPFS process is ~4.1GB, i ran the heap-dump and took an SVG out of it.
Interesting thing i have noticed is when i open the heapDump in pprof and run top command it only shows 1.7GB of allocations (whereas ipfs process is using close to 4.1GB of RAM as per top).
Does it mean the rest of 2.3GB is just not released back to the OS and is not used by ipfs?
I am unable to upload the SVG here, uploading the top usages
(pprof) top
Showing nodes accounting for 1.50GB, 87.73% of 1.71GB total
Dropped 482 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 152
flat flat% sum% cum cum%
0.99GB 57.66% 57.66% 0.99GB 57.66% strings.(*Builder).grow
0.36GB 21.02% 78.68% 0.36GB 21.02% github.com/ipfs/go-bitswap.(*Bitswap).provideCollector
I was going through the config mentioned here, there is a mention of setting sync flag to false in the flatfs config. I could not find any documentation wrt flag, could you brief me on how this would improve performance?
I forgot to mention that Go will grab memory from the OS and not return it back unless it is politely asked to. This leads to seeming that a lot of memory is used but it can actually be reclaimed anytime by the OS. A better number can be obtained from prometheus metrics exported by the daemon, those show the actual amount allocated.
Its in the datastores.md documentation file in Kubo repo. Sync=false will not commit to disk after every write. Much faster, more risk of losing data in case of sudden power off.
Oh, then in that case we should not have reached OOM right.
We are running go-ipfs via docker and on an AWS instance. Could it be possible that this environment has caused it to reach OOM?
Also, is there any flag that can be enabled to force Go to release memory back to OS (based on unalloced memory)?
I missed your screenshot. Have you tried the optimized bitswap settings mentioned in Download and setup - Pinset orchestration for IPFS ? You need to tune them to the size of the machine, but maybe right now bitswap is bottlenecked and things queue up.