Helia add directory

How can I efficiently add a directory from the filesystem using Helia? I have some working code but it is way too slow, Kubo can add a directory (node_modules - many files but not a huge size) in seconds but in JS this is taking a much longer time.

Is this a limitation of JS as a language or is there something hugely inefficient here?

const addDir = async function (dir: string): Promise<PBLink> {
	const dirents = await fs.promises.readdir(dir, { withFileTypes: true });
	const links: PBLink[] = [];

	for (const dirent of dirents) {
		const path = Path.join(dir, dirent.name);

		if (dirent.isDirectory()) {
			const link = await addDir(path);

			links.push({
				Hash: link.Hash,
				Name: dirent.name
			});
		} else {
			const cid = await unixfs.addFile({ content: fs.createReadStream(path) });

			links.push({ Name: dirent.name, Hash: cid });
		}
	}

	const metadata = new UnixFS({
		type: 'directory'
	});

	const buf = dagPB.encode({
		Data: metadata.marshal(),
		Links: links
	});

	const hash = await sha256.digest(buf);
	const cid = CID.create(1, dagPB.code, hash);

	return { Hash: cid };
};

This seems to be similar to the approach ipfs-car takes. The only problem with this approach that I can see is it loads from FS then hashes then saves to the blockstore before moving on to the next item - if this is really what is making this so slow, are there any other libraries for Helia for achieving this in a more performant way?

2 Likes

@Saul do you have the full code example you can share? It looks like there are some methods being called that aren’t defined in the snippet you provided. Some things missing:

  • Blockstore you’re using
  • how you’re instantiating helia & unixfs

Also, you’re using fs.createReadStream(path) when adding content, and it looks like that readStream creates chunks of 64KiB, which could significantly impact the speed when compared to kubo

Unlike the 16 KiB default highWaterMark for a <stream.Readable>, the stream returned by this method has a default highWaterMark of 64 KiB.

One method of creating a UnixFS directory with helia can be found at https://github.com/ipfs-examples/helia-examples/blob/f86bfb477b255a6448eab1d77e57ac827051f168/examples/helia-create-car/src/components/CarCreator.jsx#L81-L87 which may be a little more performant, but I haven’t done tests on it, and it’s not doing everything your code sample is doing.

I would love to see a new benchmark added to https://github.com/ipfs/helia/tree/main/benchmarks covering this usecase, so if you could share your full code that would be great.

I am using the default blockstore with Helia - the in-memory one which shouldn’t be a bottle neck other than being space limited.

I am instantiating Helia and UnixFS as basically as possible for this scenario as you can see in the below code.

Thanks for the suggestion of changing the highWaterMark to 16 KiB but unfortunately it has made little difference.

Here is a full source code, obviously the node_modules will differ on your machine but it should have enough files to show the issue: (Please let me know if there are any standard go-to data sets containing many small files to benchmark off.)

import type { PBLink } from '@ipld/dag-pb'
import { UnixFS } from 'ipfs-unixfs'
import { unixfs as createUnixfs } from "@helia/unixfs";
import * as dagPB from '@ipld/dag-pb'
import { sha256 } from 'multiformats/hashes/sha2'
import { CID } from 'multiformats/cid'
import Path from "path";
import fs from "fs";
import { createHelia } from "helia";
import { exec } from "child_process";

const directory = "./node_modules";

const totalFileCount = await new Promise<number>(
	resolve => exec(`find ${directory} | wc -l`, (_, result) => resolve(+result))
);

const helia = await createHelia();
const unixfs = createUnixfs(helia);

let files = 0;

const addDir = async function (dir: string): Promise<PBLink> {
	const dirents = await fs.promises.readdir(dir, { withFileTypes: true });
	const links: PBLink[] = [];

	for (const dirent of dirents) {
		const path = Path.join(dir, dirent.name);

		if (dirent.isDirectory()) {
			const link = await addDir(path);

			links.push({
				Hash: link.Hash,
				Name: dirent.name
			});
		} else {
			const cid = await unixfs.addFile({ content: fs.createReadStream(path, { highWaterMark: 16 * 1024 }) });
			files++;
			console.log(`${files}, ${(files * 100 / totalFileCount).toFixed(2)}%`);
			links.push({ Name: dirent.name, Hash: cid });
		}
	}

	const metadata = new UnixFS({
		type: 'directory'
	});

	const buf = dagPB.encode({
		Data: metadata.marshal(),
		Links: links
	});

	const hash = await sha256.digest(buf);
	const cid = CID.create(1, dagPB.code, hash);

	return { Hash: cid };
};

console.log(await addDir(directory));

Here is another version based of the code sample you mentioned:

import { unixfs as createUnixfs } from "@helia/unixfs";
import { CID } from 'multiformats/cid'
import Path from "path";
import fs from "fs";
import { createHelia } from "helia";
import { exec } from "child_process";

const directory = "./node_modules";

const totalFileCount = await new Promise<number>(
	resolve => exec(`find ${directory} | wc -l`, (_, result) => resolve(+result))
);

const helia = await createHelia();
const unixfs = createUnixfs(helia);

let files = 0;

const addDir = async function (dir: string): Promise<CID> {
	const dirents = await fs.promises.readdir(dir, { withFileTypes: true });

	let rootCid = await unixfs.addDirectory();

	for (const dirent of dirents) {
		const path = Path.join(dir, dirent.name);

		const cid = dirent.isDirectory() ?
			await addDir(path) :
			await unixfs.addFile({ content: fs.createReadStream(path, { highWaterMark: 16 * 1024 }) });

		rootCid = await unixfs.cp(cid, rootCid, dirent.name);

		files++;
		console.log(`${files}, ${(files * 100 / totalFileCount).toFixed(2)}%`);
	}

	return rootCid;
};

console.log(await addDir(directory));
1 Like

Enabling the debugger, I can see that helia:unixfs:cp jumps from taking <10ms to >1s (in the second code example) after running for a while - could this be something wrong internal of Helia or UnixFS?

Any idea what could be the cause of that? I feel that the cp part should be fairly quick and consistent since it is only adding another link to the dag.

@Saul I doubt it’s that, to get the precise reproduction of the issue, a repo would be nice to play with. With all the things @SgtPooki mentioned, I think you can restructure the code such that you don’t wait for one directory to be added at a time, I’d recommend leveraging JS promises to make these requests async more effectively.

For Example (I didn’t setup types, just running simple JS):

my package.json:

{
  "dependencies": {
    "@helia/unixfs": "^1.3.0",
    "helia": "^1.3.4",
    "multiformats": "^12.0.1"
  },
  "type": "module"
}

example.js

import { unixfs as createUnixfs } from "@helia/unixfs";
import Path from "path";
import fs from "fs";
import { createHelia } from "helia";
import { exec } from "child_process";

const directory = "./node_modules";
const helia = await createHelia();
const unixfs = createUnixfs(helia);
let files = 0;
let totalFileCount = 0;
const startTime = Date.now();

const addDir = async function (dir) {
	const dirents = await fs.promises.readdir(dir, { withFileTypes: true });
	let rootCid = await unixfs.addDirectory();

	await Promise.all(dirents.map(async dirent => {
		const path = Path.join(dir, dirent.name);

		const cid = dirent.isDirectory() ?
			await addDir(path) :
			await unixfs.addFile({ content: fs.createReadStream(path, { highWaterMark: 16 * 1024 }) });

		rootCid = await unixfs.cp(cid, rootCid, dirent.name);

		files++;
		console.log(`${((Date.now() - startTime) / 1000).toFixed(1)}s: ${files} Files, ${(files * 100 / totalFileCount).toFixed(2)}%`);
	}));

	return rootCid;
};

const run = async () => {
	totalFileCount = await new Promise(
		resolve => exec(`find ${directory} | wc -l`, (_, result) => resolve(+result))
	);
	console.log(await addDir(directory));
};
run();

I did multiple runs and all of those completed in roughly 165ish seconds, YMMV, but I didn’t experience any slowness as you described. Since the directory you’re adding is node_modules, it’s really crucial to see what the size is for you.

Also, this goes without saying, this may not work every case as it may crash the process, you can still do efficient traversal of the nodes by performing a breadth-first-traversal and then batching those requests per level or a similar strategy.

1 Like

FYI, I threw a benchmark together at test: create add-dir benchmark by SgtPooki Β· Pull Request #167 Β· ipfs/helia Β· GitHub that you can play with if you like, but all tests seem to show helia being significantly faster than js-ipfs

Current problem is that Helia is generating a different CID than js-ipfs and kubo, but that’s because some changes to defaults were made… please feel free to suggest any improvements / updates.

helia/benchmarks/add-dir/src at 0e336e2b00b5858be360647822d953fbe2417329 Β· ipfs/helia Β· GitHub shows the output of running against a somewhat smaller collection of files in a node_modules directory

Thanks a lot for this!

The benchmarks are running at a similar speed as should be expected. So it seems Kubo is a lot slower than I though it was, so after scratching my head for a bit, I think I now know what was going on, I think I must have had at least some of the blocks already in the blockstore and not cleaned it properly - I may have been thinking at the time that the bottleneck would be reading/hashing efficiently - I really should have thought this part through more thoroughly at the time.

This leads me to wonder why hashing a folder with kubo: ipfs add -rn is very fast but adding with Helia to a in-memory blockstore is very slow (compared). If the only difference between only hashing and normal adding is writing the blocks to disk I would expect to get somewhat similar performance to writing to memory since this eliminates the file system writing bottleneck. Obviously I am missing something here since this is not the case, it would be appreciated if someone could point out what it is.

We have our Helia WG meeting tomorrow and plan to discuss this then. We should have some more ideas for you after that.

Note that one action item from our WG meeting is Efficient directory import Β· Issue #168 Β· ipfs/helia Β· GitHub, which should significantly speed things up.

As far as an in memory blockstore being faster than a disk blockstore… I updated the benchmark I created to support helia-mem blockstore and datastore vs a helia-fs blockstore and datastore, and the in-memory one is significantly faster

# >  npm start
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ (index) β”‚     Implementation      β”‚  ops/s   β”‚  ms/op  β”‚ runs β”‚   p99    β”‚                              CID                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    0    β”‚    'helia-fs - src'     β”‚ '40.75'  β”‚ '24.54' β”‚  5   β”‚ '94.99'  β”‚ 'bafybeihbhrmfrwasg4ixtayjqaj3cqothpoa6e26ifwzmzkfqnornyucfy' β”‚
β”‚    1    β”‚    'helia-fs - dist'    β”‚ '12.36'  β”‚ '80.89' β”‚  5   β”‚ '333.51' β”‚ 'bafybeicmluos7lkmgcrmsxsayuxv3ulsb7n7aaifdj2mihvojglkmt6m6i' β”‚
β”‚    2    β”‚ 'helia-fs - ../gc/src'  β”‚ '41.40'  β”‚ '24.16' β”‚  5   β”‚ '94.52'  β”‚ 'bafybeihhyvzl4zqbvvtafd6cnp37gwvrypn2cxpyr2yj5zppvgk3urxgpm' β”‚
β”‚    3    β”‚    'helia-mem - src'    β”‚ '224.24' β”‚ '4.46'  β”‚  5   β”‚  '6.27'  β”‚ 'bafybeihbhrmfrwasg4ixtayjqaj3cqothpoa6e26ifwzmzkfqnornyucfy' β”‚
β”‚    4    β”‚   'helia-mem - dist'    β”‚ '69.97'  β”‚ '14.29' β”‚  5   β”‚ '24.43'  β”‚ 'bafybeicmluos7lkmgcrmsxsayuxv3ulsb7n7aaifdj2mihvojglkmt6m6i' β”‚
β”‚    5    β”‚ 'helia-mem - ../gc/src' β”‚ '291.57' β”‚ '3.43'  β”‚  5   β”‚  '4.34'  β”‚ 'bafybeihhyvzl4zqbvvtafd6cnp37gwvrypn2cxpyr2yj5zppvgk3urxgpm' β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# > TEST_PATH=../../node_modules/neo-async npm start
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ (index) β”‚               Implementation               β”‚  ops/s  β”‚  ms/op   β”‚ runs β”‚    p99    β”‚                              CID                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    0    β”‚ 'helia-fs - ../../node_modules/neo-async'  β”‚ '2.12'  β”‚ '472.23' β”‚  5   β”‚ '2041.79' β”‚ 'bafybeib5nofkubfon4upbeqvtn224uajsauqlkvlrik5p4xo53ws7e24sm' β”‚
β”‚    1    β”‚ 'helia-mem - ../../node_modules/neo-async' β”‚ '18.97' β”‚ '52.72'  β”‚  5   β”‚  '83.68'  β”‚ 'bafybeib5nofkubfon4upbeqvtn224uajsauqlkvlrik5p4xo53ws7e24sm' β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# > TEST_PATH=../../node_modules/ipfs-core npm start
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ (index) β”‚               Implementation               β”‚ ops/s  β”‚   ms/op   β”‚ runs β”‚    p99     β”‚                              CID                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    0    β”‚ 'helia-fs - ../../node_modules/ipfs-core'  β”‚ '0.13' β”‚ '7542.67' β”‚  5   β”‚ '33768.70' β”‚ 'bafybeic4duvrbtc4l5cmjnobtky5cpzkc27wjlbmqdbxdbvkpo7vlnrsue' β”‚
β”‚    1    β”‚ 'helia-mem - ../../node_modules/ipfs-core' β”‚ '1.59' β”‚ '630.08'  β”‚  5   β”‚  '702.96'  β”‚ 'bafybeic4duvrbtc4l5cmjnobtky5cpzkc27wjlbmqdbxdbvkpo7vlnrsue' β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

See the updated benchmark changes at test: create add-dir benchmark by SgtPooki Β· Pull Request #167 Β· ipfs/helia Β· GitHub

1 Like