Perceptual content identification in IPFS

flyingzumwalt · May 23, 2017, 12:07am

From @denisnazarov on Fri Apr 24 2015 16:48:32 GMT+0000 (UTC)

The following is mostly paraphrased from the paper cited below:

Content identification in P2P networks has until now been achieved by using metadata or cryptographic hashes. However, with increasing number of duplicates in different names and formats especially in (unmanaged) P2P networks, these tools have become insufficient for proper content finding. This is especially a problem for digital images, which exist in various formats and compressions as they propagate the web. See GitHub - mediachainlabs/canonical-content-registry: A global content registry on top of the Bitcoin blockchain to register canonical representations of digital media. for a thorough identification of the issue.

A possible approach is to identify the content in P2P networks by using perceptual hashes (or fingerprints) extracted from the perceptual features of the content robust to typical processing. The uniform distribution of the extracted fingerprints enables the usage of existing DHT-based keyword search mechanisms for fingerprint queries. In practice, this would allows querying the DHT for canonical metadata related to an image by using perceptual features of an image instance.

Such a system has lots of implications for persistent metadata for digital media, most importantly it enables persistent attribution. Currently, attribution for digital media is explicit and is easily lost as content goes viral on the internet. This is a major disservice to content creators because they are unable to be discovered to reap the benefits of their virality. It forces them to be reliant on centralized distribution platforms such as YouTube, Twitter, Tumblr, Instagram, etc. for identity which monetizes all the content flowing through its pipes with no regard for attribution. Persistent metadata can also enable much more effective aggregation and discovery of knowledge related to digital media.

The paper Content Based Video Identification in Peer-to- Peer Networks: Requirements and a Novel Solution by Koz and Lagendijk proposes a solution using a DHT.

They identify the following:

file names and cryptographic hashes are not sufficient to identify the multimedia files in different names and formats in current P2P systems
DHT systems are the latest state-of-the-art in P2P search mechanisms with the advantages of distributed traffic and storage, scalability, and guaranteed search

Their proposed distributed system includes the following:

A virtual fingerprint space representing the fingerprint vectors is constructed and this space is partitioned by all the peers in the network
Fingerprints of a shared image at a peer are automatically extracted by the client program at that peer.
Extracted fingerprints are mapped to the fingerprint space and indexing information about the fingerprint are stored at the peer containing the mapped position
Fingerprint queries are collaboratively routed by the peers

I am interested in implementing this specifically for images and wonder how an approach like the above can be implemented as a query layer in IPFS that points to metadata, ability to retrieve image instances, etc.

@jbenet @jessewalden @moudy @muneeb-ali @shea256

Canonical Content Registry on Blockstack

opened 05:34PM - 09 Apr 15 UTC

closed 10:14PM - 11 Jan 18 UTC

denisnazarov

locked

Hello, we are specifying a protocol for a canonical content registry. You can re…ad a detailed proposal here: https://github.com/mine-code/canonical-content-registry. We are interested in building it on top of Blockstore to do the following: 1. Store unique CCID (canonical content identifier) in Blockstore 2. Append file instance metadata to CCID (see examples in [canonical-content-registry](https://github.com/mine-code/canonical-content-registry)) 3. Tag a CCID with a nametag 4. Sign a [statement](https://github.com/openname/specifications/blob/master/profiles/profiles-v03.md#statements) stating a nametag belongs to an Openname profile 5. Append arbitrary metadata to CCID (hashtags, related CCIDs, etc) 6. Ability to easily interpret the above by a new node in the DHT I wanted to open this issue to start a discussion about extending Blockstore to support writing metadata in the format above. What are the systems current limitations and what could be possible steps to extend blockstore? Reading the README should provide some context. Let me know if I can clarify any of the ideas, and feel free to open CCR related issues on that repo. @muneeb-ali @shea256 @jessewalden @moudy

github.com/stacks-network/stacks-core

suggestion: add support for packing many files into a single transaction

opened 12:09AM - 16 Apr 15 UTC

closed 10:16PM - 11 Jan 18 UTC

shea256

locked

Currently, you can register a key-value pair on blockstore and perform lookups o…n the pair by entering the key and getting back the value. This works great for certain applications like registering user identities, but it runs into problems when other types of objects need to be registered in the blockchain. For example, in order to validate/timestamp/prove the existence of a billion documents in the blockchain, this would require a lot of transactions, a lot of data embedding, and a lot of money, to such a ridiculous extent that this process would not be feasible. A way to get around this would be to pack the registration of multiple objects into a single transaction. A single transaction/key-value pair registration could have a single unique key associated with the merkle root hash of a merkle tree, where each item in the merkle tree is the hash of an object to be registered. So to register 128 image files on the blockchain, you could do the following: 1. hash all 128 image files 2. build a merkle tree from the hashes 3. register a unique name 4. include in the value/JSON file a list of file object descriptors, where each descriptor has a locally unique subdomain and the hash of the file 5. include in the value a root URL where the files can be served To perform lookups, you would need to provide the domain and the subdomain of the packed file. _NOTE: the files can't realistically be stored in the DHT as that would lead to an insane amount of DHT bloat (imagine storing 1B 100KB avg. images). We're thinking of controlling bloat by imposing a restriction of a bitcoin transaction for each stored item (with a max size), and packing multiple files into a single transaction wouldn't be compatible with this system._ Here's an example of how this would work: Let's say I have an album of 128 images from my birthday party and I want to register them in the blockchain. And let's say their filenames are as follows: ``` 2015-05-19-birthday-1 2015-05-19-birthday-2 2015-05-19-birthday-3 ... ``` To do this, I first register "ryan-may2015" in a namespace for image storage. Next, I create a JSON file and include that as the value associated with the name (where the JSON file goes into the DHT and the hash of the file is put in the blockchain): ``` { "fileservers": ["https://cdn.shea.io/albums/may2015/"], "files": { "birthday-1": "ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb", "birthday-2": "3e23e8160039594a33894f6564e1b1348bbd7a0088d42c4acb73eeaed59c009d", "birthday-3": "2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6", ... } } ``` Last, I make sure to store my files in my personal content-addressed fileserver. That means my file `2015-05-19-birthday-1` would be stored here: ``` https://cdn.shea.io/albums/may2015/ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb ``` Then, I can perform lookups as follows: ``` blockstore-cli lookup ryan-may2015.birthday-1 ``` ...and I would get back the raw file. This new blockstore lookup would do the following: 1. lookup the name "ryan-may2015" in the index and get the associated hash 2. lookup the JSON file by the hash in the DHT 3. find the entry in the JSON file with the key "birthday-1" and grab the hash 4. issue a request to the fileserver and inspect the response to make sure the file returned matches the hash 5. return the response to the client Yay, I just registered an entire photo album in the blockchain! I think we could realistically store 100-1K files in each transaction, and maybe 10K if we pushed the data limits of the DHT. Thoughts @denisnazarov @muneeb-ali @jessewalden @moudy?

Copied from original issue: Perceptual content identification in IPFS · Issue #15 · ipfs-inactive/faq · GitHub

flyingzumwalt · May 23, 2017, 12:08am

From @jbenet on Fri Apr 24 2015 21:55:44 GMT+0000 (UTC)

@denisnazarov will respond more later this weekend, but absolutely. solid ideas. We already have rabin fingerprinting incorporated into file chunking, and will be looking at other fingerprinting techniques too. There’s lots of good ones that are content-dependent. In terms of resolving via the DHT, yeah i remember seeing a couple of papers on this. none seemed conclusively better since most diffs are actually tracked by versioning data structures anyway (git, OTs, etc). but i’ll read the paper you mentioned as soon as i can, and will try to post the others i had found here. i should say that the IPFS-DHT is built to evolve, so it’s is very possible we can incorporate this kind of thing into it.

flyingzumwalt · May 23, 2017, 12:08am

From @jbenet on Fri Apr 24 2015 21:57:19 GMT+0000 (UTC)

(also, for whoever has to pronounce “rabin fingerprinting”, it’s RAH-bin. not ra-BEEN. i had this mixed up and was recently corrected 2deg from rabin himself).

flyingzumwalt · May 23, 2017, 12:08am

From @parkan on Fri Dec 11 2015 19:15:07 GMT+0000 (UTC)

@jbenet here’s a paper describing using a LHS (Random Hyperplane Hash) for efficient similar content retrieval. The approach we’re thinking of is to use a domain-specific perceptual analysis (pHash for images, MFCC for audio, etc) then use RHH to transform it into the hamming distance space as per the paper.

Topic		Replies	Views
How can IPFS distribute dynamic content (private, server side, user-specific content like passwords)? (WIP) Help	13	4688	May 23, 2017
Accessing content by hash! Help js-ipfs , ipns	6	2270	September 14, 2017
CID concept is broken Ecosystem and Usage	68	4460	February 2, 2022
Questions after first learning about IPFS Help	12	1688	May 23, 2017
Why has no blockchain yet been established on IPFS? Help blockchain	9	2261	March 12, 2019

Perceptual content identification in IPFS

Related topics