It looks like it’s about 10Gb. It looks like there are two, largely separate problems. Distributing the archive and searching the archive. Distributing the archive is a good fit for Filecoin. Searching? Not so much.
As you suggested there’s a possibility that you could add a new similarity based hash to IPFS. It would still be questionable how well you could search based on that and it would be a lot of work. You’d probably be better off automating getting it into a database like Elastic.
Thanks! @SionoiS thats what I thought, just wanted to get a second opinion on using IPLD. @zacharywhitley I assume, searching the archive would have to happen in IPFS - afaik, there isn’t any built out solution for searching Filecoin CIDs based on metadata (but I may be wrong here).
Funny, I was going to ask you. I really like this idea, but I’m not sure it is feasible. If it is remotely feasible, some way of “tagging” CIDs in an embedded way sounds super awesome…
I dont know enough about filecoin’s CID method, and whether it allows enough space to represent the vectors
I’m not sure what the max length of a CID can be, I believe that’s tied to the longest multihash length. If we look at the spec… multiformats/cid we see this:
So I don’t see anywhere to store this info (outside of the hash itself), if we’re talking about storing it directly in a CID. You have the multibase prefix, version, multicodec type, and then the multihash itself. So if we’re baking more info in, I believe we’d need something like a cidv2 (something I just made up).
As @SionoiS said, using IPLD is at least the “more obvious” direction to go here IMHO.
Yeah probably the easiest way to look at things (I’m happy to be corrected here).
There’s a possibility that you could add a new hash and use a locality sensitive hash. Something like MinHash, SimHash, pHash if you wanted to do images. Although I’m not sure how useful it would be on the DHT. There was a discussion that I read a while ago.
You mmmmmight be able to compute hashes and/or vectors and make that available. You’d download the index, do your query and then retrieve specific documents from IPFS/Filecoin.
technically a vector embedding is a locality based hash, to search you need to compute the embedding of what you want to search for, then do the K nearest neighbors step or approximate nearest neighbor comparison to the embeddings, using various different math algorithms.