Node announcing too slowly to DHT

As I understand, DHT records for blocks have 1 day TTL. Node is not able to finish Re provider run within one day.
It would be helpful to announce pin roots first, then MFS directories, then file roots and then rest of blocks. If provider runs over 1 day, stop it and start over with following order.
There is an option for strategy something like “roots” instead of initial “all” but it did not announce enough to keep my app working.
I don’t have to announce blocks inside files. My app is accessing entire files only. Temporary fix would be to make DHT record lifetime longer.

Use the accelerated DHT client.
I have 7ms of publishing time for 1 million + CIDs.

Note: unlike the default buckets accelerated DHT client just goes faster with the more peers you have.
1000+ plus peers is a good thing with it.
I run with 7000 peers most of the time (however that a bit extreme), 2000 low / 3000 high is likely good enough for most people.

I don’t know if the default provider does that, that would be smart.
However if you wish, you can active a mode that does only those things.
See this option: https://github.com/ipfs/go-ipfs/blob/master/docs/config.md#reproviderstrategy

Note: there are big performance improvements bundling keys that goes to the same peer.
If you dump all CIDs onto the DHT client many keys will be stored by the same nodes (as there are only ~11k public nodes), what the accelerated DHT client does is that it bundles things that goes to same peers to send them all at once, doing things incrementally require dialling the same peers over and over and over.

Publishing time 7 ms for 1M CID is lie.

I suspect this is a miscommunication. @Jorropo did you mean that when providing millions of CIDs that average time per provide looked around 7ms? If so that’s quite plausible. I’ve seen nodes with tens of millions of CIDs with averages in the tens-hundreds of microseconds.

The difference between a single operation not plausibly taking 7ms and having a per-CID average across many operations being so much lower basically comes from lowering the fixed costs (forming connections and DHT routing table lookups) and sending the operations together (e.g. 7ms to download a 1KiB file from another country would sound nuts, but an average of 1KiB/7ms is 142 KiB/s which is very doable :grinning_face_with_smiling_eyes:.

1 million * 7ms = 1.94 hours, although as mentioned I’ve seen much faster than 7ms per CID as the number of CIDs being advertised scales.

I’ve miss understood:

$ ipfs config show | jq .Experimental.AcceleratedDHTClient
true
$ ipfs stats provide
TotalProvides:          142k (142,912)
AvgProvideDuration:     4.913ms
LastReprovideDuration:  11m42.172876s
LastReprovideBatchSize: 142k (142,912)

The AvgProvideDuration is in fact totalTime / numberOfCIDs, I thought it was average of all provide operations since the node is running, mb.

Publishing time 7 ms for 1M CID is lie.

If you have a list of all nodes and don’t mind starting 11k threads, killing your CPU for a few seconds, you can just send the CID to the right node in one shot and do that on all of them in parallel, then you could see 250ms or 500ms times (which ok isn’t 7ms) which is the speed of light to go around the globe. (yes IPFS doesn’t do that for now, but that not unrealistic)

10 min batches are too long. And 11k threads kills OS, needs to be non blocking IO in few threads using something like java nio api for multiplexing.

Why ?
10 minutes twice per day sound perfectly reasonable to me.
Also remember that for a full publish.
If you just added a new file you can publish only that file which takes something like almost nothing.

Sounds awefully similar to golang’s workstealing too.

Its once per day only for you because all your data fits into one batch. With more data you get much more 10 minute batches per day and simply can’t commit to 10 minutes downtime every now and then. It would be more manageable to split it to about 30 seconds batches. Because its not only kill IPFS. It kills completely other tasks running on the same machine also.

To make 10 min batches work better, you need start time configurable so nodes in cluster don’t announce at the same time and not serving files. One runs on 10th minute then waits for 1 hour, second on 30 the minute.

How many CIDs do you have though? And what are the specs of your machine?

It may be that you need a better machine, or better bandwidth, or better settings for what you are trying to do. You should try the accelerated DHT client. If that freezes your computer, then that likely indicates you need more cpu/ram/bw/disk for the amount of content you want to provide.

I just tested it on Windows 10 with 30 GB data and new DHT code. Max connections configured to 700 peaking at 3500. RAM used about 3.5 GB. It is doing about 10 minutes long batches, computer is very unresponsive during these 10 minutes, setting ipfs process priority to lowest level does not produce difference, there are about 10 second windows when mouse do not even move.
Nobody will run this on personal machine like people do with torrents. Offloading to RPI is only realistic choice. You can’t ship software like this with browsers like Brave is doing.
Its very light test - 30GB of data is nothing.

What about a scheduler that properly follow priority (linux *cough cough*) ?
I know saying “it works for me” or “just use linux” isn’t helpfull, but I have ~3k connections all the time, IPFS is niced (depriorised) and running on the same PC I use and it’s as responsive as if IPFS wasn’t running (note that not subjective, I’m mesuring that with the idle jitter, which stays very consistant at 60µs).

How many CIDs are you advertising (note: 30GB isn’t that helpful since you could be using different provider strategies, and there’s no indication as to the average block size)? What does your config look like? What does your resource utilization look like during the period you’re experiencing as unresponsive?

Without more information the only plausible responses you’re going to get here are:

  1. Must be you, because it’s fine on my machine (Jorropo’s comment above and my Windows setup even without tweaking prioritization)
  2. Mine has problems too!
  3. Yes, there’s an experimental feature known to use a bunch of resources in a spiky pattern. Work to improve things such that there is a client stable enough to be run by default that is both fast and efficient will happen when someone gets around to it since there are plenty of feature requests and bug reports to go around. If you’re interested in helping out, reach out on GitHub.

total provides 976k
LastReprovideBatchDuration 15m