Here’s an update on the set of KPIs we’re leaning towards, after discussing with several people and teams. Feel free to add or contribute further to the discussion.
Timeline
- Decide on final KPIs on 24th March (1 week from today)
- Start reporting the majority of those by the first week of April
- Present and get further feedback at IPFS Thing 2023 (15th-19th April)
- Revise and finalise in May.
KPIs we’re leaning towards
Network Size & Stability
- Overall number of unique peers seen in the network (currently by bootstrapper + preload nodes)
- Unique number of DHT Server nodes
- PeerIDs and Unique IP Addresses seen
- Unique number of DHT Client nodes
- Stability of DHT Server nodes
- Classification:
-
Online
: > 80% of time seen online. -
Mostly Online
: 40% < x < 80% of time seen online -
Mostly Offline
: 10% < x < 40% of time seen online -
Offline
: < 10% of time seen online.
-
- Classification:
- Unique number of DHT Server nodes
Performance
- DHT Latency - random content
- Publication Latency: Time to PUT/Provide
- Lookup Latency: Time to First Provider Record
- Fetch Latency
- Time to First Byte (TTFB)
- Time to Last Byte (TTLB)
- e2e Latency:
- Sum of DHT Latency + Fetch Latency
- e2e Error Rate
- Percentage of requests for which delivery of content failed
- Report on the “leg” of the process that things failed, e.g., Provider Record Discovery failure vs content fetch failure
- Website Load Latency
- Latency to load sample websites through browser (using PL websites for now)
- Some or all of the above needs to be executed for both long-running nodes, but also short-running nodes
Traffic
- Number of requests to the public, PL-operated IPFS Gateways
- Number of requests to specific vantage points we control through Bitswap and through the DHT.
- Number of provider records published to the network
- Number of unique CIDs seen through the Gateways
Abnormalities (could also be Health)
Examples include:
- Increased number of rotating PeerIDs
- Unusually high *-latency
- Increased number of unresponsive nodes, or errors (as reported at: https://github.com/protocol/network-measurements/tree/master/reports/2023/calendar-week-10/ipfs#errors)
- Increased number of nodes from some geographic location, or from one ASN/ISP.
We should define alert levels for each one of these. For instance:
-
Good Health. No abnormalities.
-
Functioning Normally. E.g., in case of increased number of Rotated PeerIDs
-
Concerning Signs/Investigation Needed. E.g., increased number of nodes from a particular geographic location
-
Red Alert: Increased number of unresponsive nodes. Performance disrupted.
Developer Activity
[More to be added by @lidel and @guseggert]
- Github Activity in IPFS related Github Orgs
- Github Activity in ipfs/specs
IMPORTANT NOTE
These are the high level KPIs, primarily targeting one component of the IPFS system, i.e., the DHT. If you’re interested in lower-level metrics for your application or implementation project, please bring them up. We do gather and will be reporting lots of lower-level, protocol-specific metrics. The current set of metrics we’re looking at can be found at our weekly reports: network-measurements/reports/2023 at master · protocol/network-measurements · GitHub.
Furthermore, we’ll be developing “project-wide KPIs” to get a broader view of the network - this will come as a separate post later in the year.