Programmatically set CIDs

aprasad · September 19, 2022, 3:41pm

Hey! I’m reviewing this grant proposal Legal Text Repository and KNN based CID search engine · Issue #923 · filecoin-project/devgrants · GitHub which proposes to set CIDs based on the content of the files for easy vectorised search and retrieval. On first glance, my understanding is that there is no way to do this given the CID is the prefix and hash. However, it seems like there’s a possibility of using IPLD to associate custom metadata with IPFS content (similar to the likecoin usecase: https://docs.ipfs.tech/concepts/case-study-likecoin/#the-story). Could @Discordian or @danieln chime in on rough feasibility?

Thanks!

SionoiS · September 20, 2022, 11:32am

My 0.02$ If I may.

CID are not meant to contain that information. The best would be to use IPLD to build indexing on top.

There must be clever ways to design indexes to search for information in a way that fits IPLD/IPFS.

zacharywhitley · September 20, 2022, 2:34pm

How large is the RECAP archive?

It looks like it’s about 10Gb. It looks like there are two, largely separate problems. Distributing the archive and searching the archive. Distributing the archive is a good fit for Filecoin. Searching? Not so much.

As you suggested there’s a possibility that you could add a new similarity based hash to IPFS. It would still be questionable how well you could search based on that and it would be a lot of work. You’d probably be better off automating getting it into a database like Elastic.

aprasad · September 20, 2022, 5:09pm

Thanks! @SionoiS thats what I thought, just wanted to get a second opinion on using IPLD. @zacharywhitley I assume, searching the archive would have to happen in IPFS - afaik, there isn’t any built out solution for searching Filecoin CIDs based on metadata (but I may be wrong here).

Discordian · September 20, 2022, 5:15pm

Funny, I was going to ask you. I really like this idea, but I’m not sure it is feasible. If it is remotely feasible, some way of “tagging” CIDs in an embedded way sounds super awesome…

I dont know enough about filecoin’s CID method, and whether it allows enough space to represent the vectors

I’m not sure what the max length of a CID can be, I believe that’s tied to the longest multihash length. If we look at the spec… multiformats/cid we see this:

<cidv1> ::= <multibase-prefix><multicodec-cidv1><multicodec-content-type><multihash-content-address>

So I don’t see anywhere to store this info (outside of the hash itself), if we’re talking about storing it directly in a CID. You have the multibase prefix, version, multicodec type, and then the multihash itself. So if we’re baking more info in, I believe we’d need something like a cidv2 (something I just made up).

As @SionoiS said, using IPLD is at least the “more obvious” direction to go here IMHO.

Yeah probably the easiest way to look at things (I’m happy to be corrected here).

zacharywhitley · September 20, 2022, 5:41pm

There’s a possibility that you could add a new hash and use a locality sensitive hash. Something like MinHash, SimHash, pHash if you wanted to do images. Although I’m not sure how useful it would be on the DHT. There was a discussion that I read a while ago.

github.com/ipfs-inactive/faq

Perceptual content identification in IPFS

opened 04:48PM - 24 Apr 15 UTC

closed 07:38PM - 23 May 17 UTC

denisnazarov

The following is mostly paraphrased from the paper cited below: Content identif…ication in P2P networks has until now been achieved by using metadata or cryptographic hashes. However, with increasing number of duplicates in different names and formats especially in (unmanaged) P2P networks, these tools have become insufficient for proper content finding. This is especially a problem for digital images, which exist in various formats and compressions as they propagate the web. See https://github.com/mine-code/canonical-content-registry for a thorough identification of the issue. A possible approach is to identify the content in P2P networks by using perceptual hashes (or fingerprints) extracted from the perceptual features of the content robust to typical processing. The uniform distribution of the extracted fingerprints enables the usage of existing DHT-based keyword search mechanisms for fingerprint queries. In practice, this would allows querying the DHT for canonical metadata related to an image by using perceptual features of an image instance. Such a system has lots of implications for persistent metadata for digital media, most importantly it enables persistent attribution. Currently, attribution for digital media is explicit and is easily lost as content goes viral on the internet. This is a major disservice to content creators because they are unable to be discovered to reap the benefits of their virality. It forces them to be reliant on centralized distribution platforms such as YouTube, Twitter, Tumblr, Instagram, etc. for identity which monetizes all the content flowing through its pipes with no regard for attribution. Persistent metadata can also enable much more effective aggregation and discovery of knowledge related to digital media. The paper [Content Based Video Identification in Peer-to- Peer Networks: Requirements and a Novel Solution](http://mmc.tudelft.nl/sites/default/files/Content%20Based%20Video%20Identification%20in%20P2P%20Networks%20_Full%20paper_.pdf) by Koz and Lagendijk proposes a solution using a DHT. They identify the following: - file names and cryptographic hashes are not sufficient to identify the multimedia files in different names and formats in current P2P systems - DHT systems are the latest state-of-the-art in P2P search mechanisms with the advantages of distributed traffic and storage, scalability, and guaranteed search Their proposed distributed system includes the following: - A virtual fingerprint space representing the fingerprint vectors is constructed and this space is partitioned by all the peers in the network - Fingerprints of a shared image at a peer are automatically extracted by the client program at that peer. - Extracted fingerprints are mapped to the fingerprint space and indexing information about the fingerprint are stored at the peer containing the mapped position - Fingerprint queries are collaboratively routed by the peers I am interested in implementing this specifically for images and wonder how an approach like the above can be implemented as a query layer in IPFS that points to metadata, ability to retrieve image instances, etc. @jbenet @jessewalden @moudy @muneeb-ali @shea256 Related: https://github.com/namesystem/blockstore/issues/79 https://github.com/namesystem/blockstore/issues/81 https://github.com/ipfs/blockchain-data/issues/1

and some publications

zacharywhitley · September 20, 2022, 6:04pm

You mmmmmight be able to compute hashes and/or vectors and make that available. You’d download the index, do your query and then retrieve specific documents from IPFS/Filecoin.

zacharywhitley · September 21, 2022, 12:21am

Check out Qdrant. https://qdrant.tech There are others you can choose from but it’s pretty lightweight and not overly complex.

You might have some luck writing snapshots to IPFS Qdrant - Snapshots

Extra bonus points for compiling Qdrant to WebAssembly although it might not perform as well.

endomorphosis · October 19, 2023, 8:54am

this was my proposal, and I have submitted an updated proposal

endomorphosis · October 19, 2023, 8:57am

technically a vector embedding is a locality based hash, to search you need to compute the embedding of what you want to search for, then do the K nearest neighbors step or approximate nearest neighbor comparison to the embeddings, using various different math algorithms.

endomorphosis · October 19, 2023, 8:58am

this is what I did with the most recent hackathon entry

Topic		Replies	Views
Using IPFS DHT for custom key/value Help	0	215	November 15, 2022
How can I link to a different kind of hash-based system Help ipld	12	1826	October 21, 2017
When hashing a CID does the metadata inside the CID track what node I hashed from Help	4	361	December 6, 2021
CID persistence	9	450	September 14, 2023
[IPLD] CID for Stellar objects / custom binary types? Help js-ipfs , ipld , multiformats , multihash	2	957	March 9, 2018

Programmatically set CIDs

Related topics