Perceptual content identification in IPFS

From @denisnazarov on Fri Apr 24 2015 16:48:32 GMT+0000 (UTC)

The following is mostly paraphrased from the paper cited below:

Content identification in P2P networks has until now been achieved by using metadata or cryptographic hashes. However, with increasing number of duplicates in different names and formats especially in (unmanaged) P2P networks, these tools have become insufficient for proper content finding. This is especially a problem for digital images, which exist in various formats and compressions as they propagate the web. See for a thorough identification of the issue.

A possible approach is to identify the content in P2P networks by using perceptual hashes (or fingerprints) extracted from the perceptual features of the content robust to typical processing. The uniform distribution of the extracted fingerprints enables the usage of existing DHT-based keyword search mechanisms for fingerprint queries. In practice, this would allows querying the DHT for canonical metadata related to an image by using perceptual features of an image instance.

Such a system has lots of implications for persistent metadata for digital media, most importantly it enables persistent attribution. Currently, attribution for digital media is explicit and is easily lost as content goes viral on the internet. This is a major disservice to content creators because they are unable to be discovered to reap the benefits of their virality. It forces them to be reliant on centralized distribution platforms such as YouTube, Twitter, Tumblr, Instagram, etc. for identity which monetizes all the content flowing through its pipes with no regard for attribution. Persistent metadata can also enable much more effective aggregation and discovery of knowledge related to digital media.

The paper Content Based Video Identification in Peer-to- Peer Networks: Requirements and a Novel Solution by Koz and Lagendijk proposes a solution using a DHT.

They identify the following:

  • file names and cryptographic hashes are not sufficient to identify the multimedia files in different names and formats in current P2P systems
  • DHT systems are the latest state-of-the-art in P2P search mechanisms with the advantages of distributed traffic and storage, scalability, and guaranteed search

Their proposed distributed system includes the following:

  • A virtual fingerprint space representing the fingerprint vectors is constructed and this space is partitioned by all the peers in the network
  • Fingerprints of a shared image at a peer are automatically extracted by the client program at that peer.
  • Extracted fingerprints are mapped to the fingerprint space and indexing information about the fingerprint are stored at the peer containing the mapped position
  • Fingerprint queries are collaboratively routed by the peers

I am interested in implementing this specifically for images and wonder how an approach like the above can be implemented as a query layer in IPFS that points to metadata, ability to retrieve image instances, etc.

@jbenet @jessewalden @moudy @muneeb-ali @shea256


Copied from original issue:

From @jbenet on Fri Apr 24 2015 21:55:44 GMT+0000 (UTC)

@denisnazarov will respond more later this weekend, but absolutely. solid ideas. We already have rabin fingerprinting incorporated into file chunking, and will be looking at other fingerprinting techniques too. There’s lots of good ones that are content-dependent. In terms of resolving via the DHT, yeah i remember seeing a couple of papers on this. none seemed conclusively better since most diffs are actually tracked by versioning data structures anyway (git, OTs, etc). but i’ll read the paper you mentioned as soon as i can, and will try to post the others i had found here. i should say that the IPFS-DHT is built to evolve, so it’s is very possible we can incorporate this kind of thing into it.

From @jbenet on Fri Apr 24 2015 21:57:19 GMT+0000 (UTC)

(also, for whoever has to pronounce “rabin fingerprinting”, it’s RAH-bin. not ra-BEEN. i had this mixed up and was recently corrected 2deg from rabin himself).

From @parkan on Fri Dec 11 2015 19:15:07 GMT+0000 (UTC)

@jbenet here’s a paper describing using a LHS (Random Hyperplane Hash) for efficient similar content retrieval. The approach we’re thinking of is to use a domain-specific perceptual analysis (pHash for images, MFCC for audio, etc) then use RHH to transform it into the hamming distance space as per the paper.