Programmatically set CIDs

Hey! I’m reviewing this grant proposal Legal Text Repository and KNN based CID search engine · Issue #923 · filecoin-project/devgrants · GitHub which proposes to set CIDs based on the content of the files for easy vectorised search and retrieval. On first glance, my understanding is that there is no way to do this given the CID is the prefix and hash. However, it seems like there’s a possibility of using IPLD to associate custom metadata with IPFS content (similar to the likecoin usecase: Case study: LikeCoin | IPFS Docs). Could @Discordian or @danieln chime in on rough feasibility?

Thanks!

1 Like

My 0.02$ If I may.

CID are not meant to contain that information. The best would be to use IPLD to build indexing on top.

There must be clever ways to design indexes to search for information in a way that fits IPLD/IPFS.

How large is the RECAP archive?

It looks like it’s about 10Gb. It looks like there are two, largely separate problems. Distributing the archive and searching the archive. Distributing the archive is a good fit for Filecoin. Searching? Not so much.

As you suggested there’s a possibility that you could add a new similarity based hash to IPFS. It would still be questionable how well you could search based on that and it would be a lot of work. You’d probably be better off automating getting it into a database like Elastic.

Thanks! @SionoiS thats what I thought, just wanted to get a second opinion on using IPLD. @zacharywhitley I assume, searching the archive would have to happen in IPFS - afaik, there isn’t any built out solution for searching Filecoin CIDs based on metadata (but I may be wrong here).

1 Like

Funny, I was going to ask you. I really like this idea, but I’m not sure it is feasible. If it is remotely feasible, some way of “tagging” CIDs in an embedded way sounds super awesome…

I dont know enough about filecoin’s CID method, and whether it allows enough space to represent the vectors

I’m not sure what the max length of a CID can be, I believe that’s tied to the longest multihash length. If we look at the spec… multiformats/cid we see this:

<cidv1> ::= <multibase-prefix><multicodec-cidv1><multicodec-content-type><multihash-content-address>

So I don’t see anywhere to store this info (outside of the hash itself), if we’re talking about storing it directly in a CID. You have the multibase prefix, version, multicodec type, and then the multihash itself. So if we’re baking more info in, I believe we’d need something like a cidv2 (something I just made up).

As @SionoiS said, using IPLD is at least the “more obvious” direction to go here IMHO.

Yeah probably the easiest way to look at things (I’m happy to be corrected here).

There’s a possibility that you could add a new hash and use a locality sensitive hash. Something like MinHash, SimHash, pHash if you wanted to do images. Although I’m not sure how useful it would be on the DHT. There was a discussion that I read a while ago.

and some publications

1 Like

You mmmmmight be able to compute hashes and/or vectors and make that available. You’d download the index, do your query and then retrieve specific documents from IPFS/Filecoin.

2 Likes

Check out Qdrant. https://qdrant.tech There are others you can choose from but it’s pretty lightweight and not overly complex.

You might have some luck writing snapshots to IPFS Qdrant - Snapshots

Extra bonus points for compiling Qdrant to WebAssembly although it might not perform as well.

1 Like

this was my proposal, and I have submitted an updated proposal

technically a vector embedding is a locality based hash, to search you need to compute the embedding of what you want to search for, then do the K nearest neighbors step or approximate nearest neighbor comparison to the embeddings, using various different math algorithms.

this is what I did with the most recent hackathon entry