IPFS Metrics & KPIs

As part of a revamp we’re doing to the way we look at metrics for the IPFS Network, I’d like to open the discussion on what KPIs would be useful to report for developers and the wider community. The list of metrics we’re converging towards as KPIs are listed below. Please feel free to add comments to the list or suggest new ones that we can work towards.

Network Size & Stability

  • Overall number of unique peers seen in the network (currently by PL-operated bootstrapper nodes)
    • Unique number of DHT Server nodes

      • PeerIDs and Unique IP Addresses seen
    • Unique number of DHT Client nodes

    • Stability of DHT Server nodes

      • Classification:
        • Online: > 80% of time seen online.
        • Mostly Online: 40% < x < 80% of time seen online
        • Mostly Offline: 10% < x < 40% of time seen online
        • Offline: < 10% of time seen online.

Performance

  • DHT Lookup Latency - random content
    • Time to First Provider Record
  • Latency to load sample websites through browser (using PL websites for now)
  • Future: e2e performance - random content
    • TTFB
    • TTLB

Traffic

  • Number of requests to the public, PL-operated IPFS Gateways
  • Future: Number of provider records published to the network
  • Future: Number of requests to specific vantage points we control through Bitswap and through the DHT.

Abnormalities

[Reported when needed]

  • Increased number of rotating PeerIDs
  • Increased number of unresponsive nodes
  • Increased number of nodes from some geographic location

Developer Activity

  • Github Activity in IPFS related Github Orgs
  • Github Activity in ipfs/specs
5 Likes

For visibility, the set of metrics above that are reported in PL EngRes monthly all hands are explained more thoroughly in Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

Time To Last Byte ?

Yup. Clarifying for others too:

  • TTFB: Time To First Byte
  • TTLB: Time To Last Byte

Performance: Agree with TTFB and TTLB. I would also like to see performance related to adding a document to IPFS over several document sizes, with TTFB and TTLB. This would not be an HTTP gateway operation, but direct to a local IPFS node perhaps through the Kubo interface. Perhaps also defining the platform in terms of number of cores, number of threads, etc. The idea would be to provide some indication of the expected performance and trying to get a view of scalability. Perhaps these KPIs already exist?

Developer Activity / Concept Proliferation

Since “Developer Activity” was mentioned as a potential KPI, I’d like us to include something related to proliferation of foundational primitives seen in the wild.

Rationale is that we may not see all the projects that use our code, people may run private DHTs and manually peered swarms, but we should see relative growth of mentions of content paths, URIs, CIDs and content types

What do we track “proliferation” exactly? This is good question. Some ideas/examples:

If we add them all up, we will get some number. It is meaningless on it’s own, but by tracking it over time we could see the +/- trend month-to-month. The value here is that all these mentions go beyond people using our libraries, or public DHT.

1 Like

Regarding latency to load sample websites:

Until now, we focussed on the TTFB and domContentLoaded metrics. While working on our website monitoring infrastructure last week I read up on how to measure website performance and came across this list:

To quote the website:

Performance metrics

There is no single metric or test that can be run on a site to evaluate how a user “feels”. However, there are a number of metrics that can be “helpful indicators”:

First paint
The time to start of first paint operation. Note that this change may not be visible; it can be a simple background color update or something even less noticeable.

First Contentful Paint (FCP)
The time until first significant rendering (e.g. of text, foreground or background image, canvas or SVG, etc.). Note that this content is not necessarily useful or meaningful.

First Meaningful Paint (FMP)
The time at which useful content is rendered to the screen.

Largest Contentful Paint (LCP)
The render time of the largest content element visible in the viewport.

Speed index
Measures the average time for pixels on the visible screen to be painted.

Time to interactive
Time until the UI is available for user interaction (i.e. the last long task of the load process finishes).

I think the relevant metrics on this list for us are First Contentful Paint, Largest Contentful Paint, and Time to interactive. First Meaningful Paint is deprecated (you can see that if you follow the link) and they recommend: “[…] consider using the LargestContentfulPaint API instead.”.

First paint would include changes that “may not be visible”, so I’m not particularly fond of this metric.

Speed index seems to be very much “website-specific”. With “website-specific” I mean that the IPFS network wouldn’t play a role in this metric. We would measure the performance of the website itself. I would argue that this is not something we want.

Besides the above metrics, we should still measure timeToFirstByte. According to Time to First Byte (TTFB) the metric would be the time difference between startTime and responseStart:

In the above graph you can also see the two timestamps domContentLoadedEventStart and domContentLoadedEventEnd. So I would think that the domContentLoaded metric would just be the difference between the two. However, this seems to only account for the processing time of the HTML (+ deferred JS scripts).

We could instead define domContentLoaded as the time difference between startTime and domContentLoadedEventEnd.


The current website measurement setup gathers the following data:

We could also include:

  • Time to interactive
  • domContentLoaded - as defined above

I believe we won’t be able to report all the above metrics, so if I had the choice between only two, I would choose timeToFirstByte and largestContentfulPaint.

@K8hwS IIUC you refer to the PUT operation (aka Provide operation) to the DHT, right? I agree this is an important item to have on the list of KPIs and we’ve started including relevant results in our weekly reports as of this week - please see: network-measurements/README.md at master · protocol/network-measurements · GitHub.

The plots on the left hand side in Section DHT performance refer to content publication (essentially the provide operation). We do not test with different file sizes, but there wouldn’t be any difference anyway? We’re talking about a provider record IIUC. Can you clarify if you mean something else?

Can you provide some clarification wrt to number of cores, number of threads and scalability? Do you refer to the number of cores and threads needed when trying to provide large volumes of content to the network?

Here’s an update on the set of KPIs we’re leaning towards, after discussing with several people and teams. Feel free to add or contribute further to the discussion.

Timeline

  • Decide on final KPIs on 24th March (1 week from today)
  • Start reporting the majority of those by the first week of April
  • Present and get further feedback at IPFS Thing 2023 (15th-19th April)
  • Revise and finalise in May.

KPIs we’re leaning towards

Network Size & Stability

  • Overall number of unique peers seen in the network (currently by bootstrapper + preload nodes)
    • Unique number of DHT Server nodes
      • PeerIDs and Unique IP Addresses seen
    • Unique number of DHT Client nodes
    • Stability of DHT Server nodes
      • Classification:
        • Online: > 80% of time seen online.
        • Mostly Online: 40% < x < 80% of time seen online
        • Mostly Offline: 10% < x < 40% of time seen online
        • Offline: < 10% of time seen online.

Performance

  • DHT Latency - random content
    • Publication Latency: Time to PUT/Provide
    • Lookup Latency: Time to First Provider Record
  • Fetch Latency
    • Time to First Byte (TTFB)
    • Time to Last Byte (TTLB)
  • e2e Latency:
    • Sum of DHT Latency + Fetch Latency
  • e2e Error Rate
    • Percentage of requests for which delivery of content failed
    • Report on the “leg” of the process that things failed, e.g., Provider Record Discovery failure vs content fetch failure
  • Website Load Latency
    • Latency to load sample websites through browser (using PL websites for now)
  • Some or all of the above needs to be executed for both long-running nodes, but also short-running nodes

Traffic

  • Number of requests to the public, PL-operated IPFS Gateways
  • Number of requests to specific vantage points we control through Bitswap and through the DHT.
  • Number of provider records published to the network
  • Number of unique CIDs seen through the Gateways

Abnormalities (could also be Health)

Examples include:

We should define alert levels for each one of these. For instance:

  • :green_circle: Good Health. No abnormalities.
  • :yellow_circle: Functioning Normally. E.g., in case of increased number of Rotated PeerIDs
  • :orange_circle: Concerning Signs/Investigation Needed. E.g., increased number of nodes from a particular geographic location
  • :red_circle: Red Alert: Increased number of unresponsive nodes. Performance disrupted.

Developer Activity

[More to be added by @lidel and @guseggert]

  • Github Activity in IPFS related Github Orgs
  • Github Activity in ipfs/specs

IMPORTANT NOTE

These are the high level KPIs, primarily targeting one component of the IPFS system, i.e., the DHT. If you’re interested in lower-level metrics for your application or implementation project, please bring them up. We do gather and will be reporting lots of lower-level, protocol-specific metrics. The current set of metrics we’re looking at can be found at our weekly reports: network-measurements/reports/2023 at master · protocol/network-measurements · GitHub.
Furthermore, we’ll be developing “project-wide KPIs” to get a broader view of the network - this will come as a separate post later in the year.

Hi, i think i may contribute to this project. I remember listening to a talk in Lisbon and got alerted again yesterday watching the implementers sync on the youtubes.

My applicable web2 skill is building dashboards with d3.js. It’s a very versatile way of designing and building graphs , visualising data by binding data to two-dimensional shape attributes. Have a look:

https://www.schadedoormijnbouw.nl/dashboard
https://www.schadedoormijnbouw.nl/dashboard?topic=fysieke_schade
https://www.schadedoormijnbouw.nl/dashboard?topic=voortgang

My web3 interest are p2p networks and decentralised db’s. So this project feels like a good fit.

Would you mind if i had a go at building a dashboard with these data?

If so what are your plans to publish these data besides on Github? Are crawler results stored somewhere? Could i try to put 'm in a decentralized/distributed db myself? Or an IPLD dag? I am happy to collaborate or experiment myself.

Looking forward, Joera

joera@joeramulders.com

Help is definitely welcome - thanks for offering! :raised_hands:

We do have some of our dashboarding plans in place (mostly using plotly), but would be happy to discuss and see if we can find common interests. Will reach out separately.

As per my earlier message: IPFS Metrics & KPIs - #9 by yiannis here’s an update to the timeline reported.

Timeline

It is worth noting that these are the high level KPIs, primarily targeting one component of the IPFS system, i.e., the DHT. We’ll be developing “project-wide KPIs” to get a broader view of the network - this will come as a separate post and call for feedback later in the year.