I maybe want to write a custom chunker, how to proceed?

MatrixMantis · October 18, 2024, 1:02am

Hello all, I’m new around here. Maybe this has an obvious answer and my search-fu was not strong enough. If so, I apologize.

I’m tinkering with something that will use kubo to store several copies of mostly similar data (the data is code for the same project, but at different versions).

My understanding is that if I add each of these sequentially, the way they are chunked will not have anything to do with the other files I have already added. But since they are mostly similar, I think it would be more efficient to chunk them based each other’s contents. Here’s an example of the kind of optimization I mean.

alpha.txt  ABCDEFGHIJKLMNOPQRSTUVWXYZ
amino.txt  ACDEFGHIKLMNPQRSTVWY

4 char chunks, greedy. 36 chars

alpha: ABCD EFGH IJKL MNOP QRST UVWX YZ
amino: ACDE FGHI KLMN PQRS TVWY

1-7 char chunks, context aware. 32 chars

alpha: AB JK O TUVW XYZ
both:  CDEFGHI LMN PQRS 
amino: A K TVWY

To put it differently: I’m looking to deviate from the byte-at-a-time chunking interface found in boxo/chunker/parse.go (seems this is used by kubo). I’ll probably end up with something that resembles (or wraps) BLAST.

The fork in my road looks like this:

do this analysis in my app, create chunks one at a time so I can explicitly control their composition
add a ipfs files repack <path> option to kubo which takes all of the (presumably similar) files in and repacks them in this way.
contribute it… somewhere else? Maybe there is a collection of custom chunkers that I can add it to?

I don’t know the ifps ecosystem well. If one if these seems like an obvious choice compared with the others, please point me at it, thanks!

lidel · October 18, 2024, 4:00pm

I think there are two ways you can explore:

Easy: play with generic chunkers built-into Kubo (supported via ipfs add --chunker).
- namely rabin-[min]-[avg]-[max] and buzzhash may produce better deduplication than the default size-based one. YMMV, but perhaps it is good enough for your dataset for now, and you can always switch to external/advanced chunking later.
Advanced: writing your own chunker.
- Good news is that you don’t need to fork Kubo for this.
  - You can use GO, or any other language, as long you follow the UnixFS spec/conventions.
  - Custom chunking that is compatible with UnixFS has been done before in JS, see prior art from Webrecorder project:
    - IPFS Custom File Chunking for WARC and WACZ (explainer/specs)
    - CLI and library for create composite files in IPFS (code)
- If you look for high level guidance
  - I’d say write your own tool that reads a directory from your local filesystem and outputs a CAR stream with valid UnixFS DAG (dag-pb&raw blocks)
  - then pipe it to Kubo via ipfs dag import
  - If you do this in GO you could also reuse boxo and existing libraries to a degree, we also have some wip UnixFS specs here if you want to fine-tune chunking your own way.

MatrixMantis · October 18, 2024, 4:14pm

Tremendously helpful, thank you!

Topic		Replies	Views
Where is the source code of setting the maximum chunk size? Kubo go-ipfs	5	248	October 6, 2022
Performance characteristics of different chunking configurations Kubo ipld	0	428	July 1, 2020
How to change the maximum chunk size ？ Kubo go-ipfs	3	329	October 6, 2022
Which chunking algorithms are available? Help	2	862	May 23, 2017
Can i use bitswap for my usecase Help ipld , go-ipfs , libp2p , dht , kubo , use-cases-and-apps	2	35	December 17, 2024

I maybe want to write a custom chunker, how to proceed?

Related topics