Hello all, I’m new around here. Maybe this has an obvious answer and my search-fu was not strong enough. If so, I apologize.
I’m tinkering with something that will use kubo to store several copies of mostly similar data (the data is code for the same project, but at different versions).
My understanding is that if I add each of these sequentially, the way they are chunked will not have anything to do with the other files I have already added. But since they are mostly similar, I think it would be more efficient to chunk them based each other’s contents. Here’s an example of the kind of optimization I mean.
alpha.txt ABCDEFGHIJKLMNOPQRSTUVWXYZ
amino.txt ACDEFGHIKLMNPQRSTVWY
4 char chunks, greedy. 36 chars
alpha: ABCD EFGH IJKL MNOP QRST UVWX YZ
amino: ACDE FGHI KLMN PQRS TVWY
1-7 char chunks, context aware. 32 chars
alpha: AB JK O TUVW XYZ
both: CDEFGHI LMN PQRS
amino: A K TVWY
To put it differently: I’m looking to deviate from the byte-at-a-time chunking interface found in boxo/chunker/parse.go (seems this is used by kubo). I’ll probably end up with something that resembles (or wraps) BLAST.
The fork in my road looks like this:
- do this analysis in my app, create chunks one at a time so I can explicitly control their composition
- add a
ipfs files repack <path>
option to kubo which takes all of the (presumably similar) files in and repacks them in this way. - contribute it… somewhere else? Maybe there is a collection of custom chunkers that I can add it to?
I don’t know the ifps ecosystem well. If one if these seems like an obvious choice compared with the others, please point me at it, thanks!