Content analytics on IPFS

Publishers want to track the success of the content they publish regardless of distribution method (HTTP/IPFS/other). In print, you track how many physical copies you move, on the HTTP-web you track GET-requests to a hit counter, … but what is the equivalent for IPFS?

1 Like

Thanks for asking this. It’s an important topic. Here are some of the observations I’ve been able to gather:

Counting GET Requests is a bad metric. People don’t use it.

GET requests on an HTTP endpoint are a bad metric for many reasons. This is partially because it’s a noisy signal (how do you differentiate DDoS from content being popular?) and because it’s measuring network activity rather than measuring engagement. As a result, anybody who needs good quality signals about engagement and usage can’t rely on GET requests as a meaningful metric.

Contemporary metrics (page loads, clicks) can still be used

The metrics that people actually rely on (ie Google analytics) actually measure page loads, clicks, etc. This is predominantly achieved using javascript, embedded in the content you’re delivering, that runs in the browser when the content is viewed. That embedded code reports information from the browser to a monitoring service. For better or worse, those strategies are agnostic to how the content was delivered to the end-user, meaning you can use them with p2p delivery modalities as well as centralized modalities. This brings up some design questions around content security Those require serious consideration. If those designs evolve well, we will see a net increase in end-users’ protections against malware and spyware while still preserving healthy channels for content publishers to get clear signals from their users about the relative value of content they have provided.

P2P and Content-Addressing Provide New Opportunities for Acquiring Better Signals from Users

Unlike the location-addressed web, p2p approaches give content publishers new opportunities for meaningful metrics. For example,

  • if clusters of nodes add content to their pin sets (ie. a research library adds the pin set to their cluster), they can report this back to the publishers, providing a signal that the operators of that cluster value the content enough to store a copy.
  • when people link to the content they are using hash identifiers. this opens up many possibilities for aggregatin, analyzing and reporting on the network of linked content – giving you impact factors rather than just measuring page loads and clicks
  • any metadata created by other parties, either explicitly or implicitly, if published on the p2p network, can be used to augment discovery and/or provide clearer information about who, when, where, why and how the content is being used
  • also, of course, at any given point you can query the network to see how many public nodes are providing the content and where those nodes are in the world
6 Likes

That last point was the first thing that came to mind: ipfs dht findprovs <hash>. However, to my knowledge there is no way to distinguish between nodes where the content is actually pinned, and nodes that only cached the content, because their owner only accessed it once or so. But I’m not sure if ipfs should provide more fine-grained information like that, or if that’s even possible.

2 Likes

You could check for how long they keep it, and if it’s a full copy. If they have 99% it’s probably cached.

I have been thinking about this topic a lot too. Seems like you have a lot of options, especially if it is browser/web friendly since you can leverage current tech as well as the node querying ability. If it is non-web however, I suppose you could always embed a beacon in the content itself (and try to make it vital to operation). Definitely variable based on the use-case.