From @denisnazarov on Fri Apr 24 2015 16:48:32 GMT+0000 (UTC)
The following is mostly paraphrased from the paper cited below:
Content identification in P2P networks has until now been achieved by using metadata or cryptographic hashes. However, with increasing number of duplicates in different names and formats especially in (unmanaged) P2P networks, these tools have become insufficient for proper content finding. This is especially a problem for digital images, which exist in various formats and compressions as they propagate the web. See https://github.com/mine-code/canonical-content-registry for a thorough identification of the issue.
A possible approach is to identify the content in P2P networks by using perceptual hashes (or fingerprints) extracted from the perceptual features of the content robust to typical processing. The uniform distribution of the extracted fingerprints enables the usage of existing DHT-based keyword search mechanisms for fingerprint queries. In practice, this would allows querying the DHT for canonical metadata related to an image by using perceptual features of an image instance.
Such a system has lots of implications for persistent metadata for digital media, most importantly it enables persistent attribution. Currently, attribution for digital media is explicit and is easily lost as content goes viral on the internet. This is a major disservice to content creators because they are unable to be discovered to reap the benefits of their virality. It forces them to be reliant on centralized distribution platforms such as YouTube, Twitter, Tumblr, Instagram, etc. for identity which monetizes all the content flowing through its pipes with no regard for attribution. Persistent metadata can also enable much more effective aggregation and discovery of knowledge related to digital media.
The paper Content Based Video Identification in Peer-to- Peer Networks: Requirements and a Novel Solution by Koz and Lagendijk proposes a solution using a DHT.
They identify the following:
- file names and cryptographic hashes are not sufficient to identify the multimedia files in different names and formats in current P2P systems
- DHT systems are the latest state-of-the-art in P2P search mechanisms with the advantages of distributed traffic and storage, scalability, and guaranteed search
Their proposed distributed system includes the following:
- A virtual fingerprint space representing the fingerprint vectors is constructed and this space is partitioned by all the peers in the network
- Fingerprints of a shared image at a peer are automatically extracted by the client program at that peer.
- Extracted fingerprints are mapped to the fingerprint space and indexing information about the fingerprint are stored at the peer containing the mapped position
- Fingerprint queries are collaboratively routed by the peers
I am interested in implementing this specifically for images and wonder how an approach like the above can be implemented as a query layer in IPFS that points to metadata, ability to retrieve image instances, etc.
@jbenet @jessewalden @moudy @muneeb-ali @shea256
Related:
https://github.com/ipfs/blockchain-data/issues/1
Copied from original issue: https://github.com/ipfs/faq/issues/15