How can I efficiently add a directory from the filesystem using Helia? I have some working code but it is way too slow, Kubo can add a directory (node_modules - many files but not a huge size) in seconds but in JS this is taking a much longer time.
Is this a limitation of JS as a language or is there something hugely inefficient here?
This seems to be similar to the approach ipfs-car takes. The only problem with this approach that I can see is it loads from FS then hashes then saves to the blockstore before moving on to the next item - if this is really what is making this so slow, are there any other libraries for Helia for achieving this in a more performant way?
@Saul do you have the full code example you can share? It looks like there are some methods being called that arenβt defined in the snippet you provided. Some things missing:
Blockstore youβre using
how youβre instantiating helia & unixfs
Also, youβre using fs.createReadStream(path) when adding content, and it looks like that readStream creates chunks of 64KiB, which could significantly impact the speed when compared to kubo
Unlike the 16 KiB default highWaterMark for a <stream.Readable>, the stream returned by this method has a default highWaterMark of 64 KiB.
I am using the default blockstore with Helia - the in-memory one which shouldnβt be a bottle neck other than being space limited.
I am instantiating Helia and UnixFS as basically as possible for this scenario as you can see in the below code.
Thanks for the suggestion of changing the highWaterMark to 16 KiB but unfortunately it has made little difference.
Here is a full source code, obviously the node_modules will differ on your machine but it should have enough files to show the issue: (Please let me know if there are any standard go-to data sets containing many small files to benchmark off.)
import type { PBLink } from '@ipld/dag-pb'
import { UnixFS } from 'ipfs-unixfs'
import { unixfs as createUnixfs } from "@helia/unixfs";
import * as dagPB from '@ipld/dag-pb'
import { sha256 } from 'multiformats/hashes/sha2'
import { CID } from 'multiformats/cid'
import Path from "path";
import fs from "fs";
import { createHelia } from "helia";
import { exec } from "child_process";
const directory = "./node_modules";
const totalFileCount = await new Promise<number>(
resolve => exec(`find ${directory} | wc -l`, (_, result) => resolve(+result))
);
const helia = await createHelia();
const unixfs = createUnixfs(helia);
let files = 0;
const addDir = async function (dir: string): Promise<PBLink> {
const dirents = await fs.promises.readdir(dir, { withFileTypes: true });
const links: PBLink[] = [];
for (const dirent of dirents) {
const path = Path.join(dir, dirent.name);
if (dirent.isDirectory()) {
const link = await addDir(path);
links.push({
Hash: link.Hash,
Name: dirent.name
});
} else {
const cid = await unixfs.addFile({ content: fs.createReadStream(path, { highWaterMark: 16 * 1024 }) });
files++;
console.log(`${files}, ${(files * 100 / totalFileCount).toFixed(2)}%`);
links.push({ Name: dirent.name, Hash: cid });
}
}
const metadata = new UnixFS({
type: 'directory'
});
const buf = dagPB.encode({
Data: metadata.marshal(),
Links: links
});
const hash = await sha256.digest(buf);
const cid = CID.create(1, dagPB.code, hash);
return { Hash: cid };
};
console.log(await addDir(directory));
Here is another version based of the code sample you mentioned:
Enabling the debugger, I can see that helia:unixfs:cp jumps from taking <10ms to >1s (in the second code example) after running for a while - could this be something wrong internal of Helia or UnixFS?
Any idea what could be the cause of that? I feel that the cp part should be fairly quick and consistent since it is only adding another link to the dag.
@Saul I doubt itβs that, to get the precise reproduction of the issue, a repo would be nice to play with. With all the things @SgtPooki mentioned, I think you can restructure the code such that you donβt wait for one directory to be added at a time, Iβd recommend leveraging JS promises to make these requests async more effectively.
For Example (I didnβt setup types, just running simple JS):
I did multiple runs and all of those completed in roughly 165ish seconds, YMMV, but I didnβt experience any slowness as you described. Since the directory youβre adding is node_modules, itβs really crucial to see what the size is for you.
Also, this goes without saying, this may not work every case as it may crash the process, you can still do efficient traversal of the nodes by performing a breadth-first-traversal and then batching those requests per level or a similar strategy.
Current problem is that Helia is generating a different CID than js-ipfs and kubo, but thatβs because some changes to defaults were madeβ¦ please feel free to suggest any improvements / updates.
The benchmarks are running at a similar speed as should be expected. So it seems Kubo is a lot slower than I though it was, so after scratching my head for a bit, I think I now know what was going on, I think I must have had at least some of the blocks already in the blockstore and not cleaned it properly - I may have been thinking at the time that the bottleneck would be reading/hashing efficiently - I really should have thought this part through more thoroughly at the time.
This leads me to wonder why hashing a folder with kubo: ipfs add -rn is very fast but adding with Helia to a in-memory blockstore is very slow (compared). If the only difference between only hashing and normal adding is writing the blocks to disk I would expect to get somewhat similar performance to writing to memory since this eliminates the file system writing bottleneck. Obviously I am missing something here since this is not the case, it would be appreciated if someone could point out what it is.
As far as an in memory blockstore being faster than a disk blockstore⦠I updated the benchmark I created to support helia-mem blockstore and datastore vs a helia-fs blockstore and datastore, and the in-memory one is significantly faster