I maybe want to write a custom chunker, how to proceed?

Hello all, I’m new around here. Maybe this has an obvious answer and my search-fu was not strong enough. If so, I apologize.

I’m tinkering with something that will use kubo to store several copies of mostly similar data (the data is code for the same project, but at different versions).

My understanding is that if I add each of these sequentially, the way they are chunked will not have anything to do with the other files I have already added. But since they are mostly similar, I think it would be more efficient to chunk them based each other’s contents. Here’s an example of the kind of optimization I mean.

alpha.txt  ABCDEFGHIJKLMNOPQRSTUVWXYZ
amino.txt  ACDEFGHIKLMNPQRSTVWY

4 char chunks, greedy. 36 chars

alpha: ABCD EFGH IJKL MNOP QRST UVWX YZ
amino: ACDE FGHI KLMN PQRS TVWY

1-7 char chunks, context aware. 32 chars

alpha: AB JK O TUVW XYZ
both:  CDEFGHI LMN PQRS 
amino: A K TVWY

To put it differently: I’m looking to deviate from the byte-at-a-time chunking interface found in boxo/chunker/parse.go (seems this is used by kubo). I’ll probably end up with something that resembles (or wraps) BLAST.

The fork in my road looks like this:

  1. do this analysis in my app, create chunks one at a time so I can explicitly control their composition
  2. add a ipfs files repack <path> option to kubo which takes all of the (presumably similar) files in and repacks them in this way.
  3. contribute it… somewhere else? Maybe there is a collection of custom chunkers that I can add it to?

I don’t know the ifps ecosystem well. If one if these seems like an obvious choice compared with the others, please point me at it, thanks!

I think there are two ways you can explore:

  1. Easy: play with generic chunkers built-into Kubo (supported via ipfs add --chunker).
    • namely rabin-[min]-[avg]-[max] and buzzhash may produce better deduplication than the default size-based one. YMMV, but perhaps it is good enough for your dataset for now, and you can always switch to external/advanced chunking later.
  2. Advanced: writing your own chunker.
    • Good news is that you don’t need to fork Kubo for this.
    • If you look for high level guidance
      • I’d say write your own tool that reads a directory from your local filesystem and outputs a CAR stream with valid UnixFS DAG (dag-pb&raw blocks)
      • then pipe it to Kubo via ipfs dag import
      • If you do this in GO you could also reuse boxo and existing libraries to a degree, we also have some wip UnixFS specs here if you want to fine-tune chunking your own way.
2 Likes

Tremendously helpful, thank you!