I’m interested in using IPFS as a means of distributing and archiving scientific datasets. Towards this end, I’ve been running a project for the past ~3 years (starting in late 2020) where I’ve been running a Rasberry Pi 4 that hosts a ~30GB image dataset. I want to share my experience as a real-world use case for IPFS and document what has been easy / hard. My hope is that this use-case can be useful to identify and motivate improvements for kubo, or perhaps help me improve my setup.
Use Case
I’m working on a project called shitspotter where I’m training a neural network to detect dog poop in images. This requires having access to a labeled image dataset which I have been building. Details about the project are here: GitHub - Erotemic/shitspotter: An open source algorithm and dataset for finding poop in pictures. A work in progress.
In terms of interacting with IPFS, what I do is I have a root folder that contains an assets directory. Every month I make a new folder corresponding to that date, and copy all of the new images I’ve taken into it. I then run
ipfs add --pin -r <root> --progress --cid-version=1 --raw-leaves=false
on the root path, which crawls the directory, identifies content that have already been uploaded, and then tracks the new content. This gives me a new CID for the updated dataset, and this is what I publish on the github readme. I’ve found this to be nice, because old folders keep their old CID, so as if people pin the dataset at any point, they help host at least some of the content, even as new data is added. It is important to note that I originally uploaded the dataset without --cid-version=1
, and that is why I’m setting --raw-leaves=false
so anyone who pinned the data when I first released it is still helping to host that data as I continue to add new data.
What’s been easy
Working with IPFS in a LAN has been great. I pin the data on my main machine (which is not exposed to the WAN), and then I run a pin command on my rasberry pi (which is connected to the WAN), and that very quickly transfers all the data to the public-facing IPFS server.
My employer also lets me use their IPFS server to re-pin the data. This tends to be a bit slower than working on the LAN, but its’ reasonable, and it guarentees me that there are at least 2 nodes pinning the entire dataset.
At one point I was using web3.storage
to have a 3rd host of the data, but they not longer offer a free-tier, so I stopped doing that. Still the “–service” option is very nice.
What’s been hard
While I’ve been able to access the data very quickly, that hasn’t been the case for other people. I recently got an email from a person interested in using the data, and they attempted to run:
ipfs ls bafybeie275n5f4f64vodekmodnktbnigsvbxktffvy2xxkcfsqxlie4hrm
The top-level content is a few folders and several smaller ~100KB files:
bafybeifqbkqxif73ewelbnr4cqfhpljl2yz2rfksb2y7dvyhokiddsd5qy - _cache/
bafybeic5a4kjrb37tdmc6pzlpcxe2x6hc4kggemnqm2mcdu4tmrzvir6vm - analysis/
bafybeicirkqvz6pedd3mvpokyo7cwy2x3isxxnjiplzdgsi22qxs2wv6ie - assets/
bafybeiesjhwbueg7nuyy4ga2dfxo7bjvkk3hhiesukqygogcmbfqgqg2ee 3119681 data.kwcoco.json
bafybeig63rot73r22hwnzw2dofqzvmz5ubwdnb5dn5xr5ylmwv4uh2ca3y 79 train.kwcoco.zip
bafybeifagh5dvowtjlepejnppbjcyfxipt55c36s4i6s7ljbcqkw5euuqm 66229 train_imgs278_27bcbd3e.kwcoco.zip
bafybeifcch6zn4y6a73ougeltsfetlhmzwjxkuhh2h2zk4iqkuo4ihswre 84417 train_imgs346_11d67089.kwcoco.zip
bafybeicwt6iy5crccdpybk3io7teb2innbmt5u3rqqtkkg6347bdc3ck2y 84528 train_imgs346_3e3fc072.kwcoco.zip
bafybeibvesr2rxfjaj6y6mfwyslq5wfifyp6rfrerybtplljwseccf5p5u 91111 train_imgs386_4653bb8f.kwcoco.zip
bafybeifaidhinenbykuei6ptxeedn4b3ypxwp57wyjb6cv4iu5ssdiujeu 105749 train_imgs454_ff3d0b9d.kwcoco.zip
bafybeibubxgytkp67yo5jczgnmbmxvy3sbtajulqqofdpkhoqckeuygd6q 146694 train_imgs647_576c8a63.kwcoco.zip
bafybeihftuyiorcbfhb5bjvcniu4dw5q4gckoavsrhtm4nbzegml4rzmim 148486 train_imgs647_65ea74f6.kwcoco.zip
bafybeihk7t4jt6pjvlsqd5glb3l5xb5wtrmbvi4nof4u6kw2z2jitteliq 178546 train_imgs760_19315e7e.kwcoco.zip
bafybeifgvlqkp4npr445n5l7yvw2wflmaaiwpmpv7toappku3gl5xoa5me 78 vali.kwcoco.zip
bafybeig7eu5zb54d7cbacd4ydjn2omezkwnczt2gub2rsyxowlaze4g5ui 45578 vali_imgs159_248a33db.kwcoco.zip
bafybeigajvmrj4cpbuumtdrzbzj27hfyqqbnren7sjmrumjcttkisby4oi 45068 vali_imgs159_ed881576.kwcoco.zip
bafybeib7l4mrlqgg6xlkz3n6u4pj37xlkftgqrdu267ifedei24yuz7jgm 28237 vali_imgs84_078e0ebf.kwcoco.zip
bafybeih2onl4ql73cmvqkvpkw6or4bw562hlrgzmztkvnqahf5ioo3xwtm 28432 vali_imgs84_8b1bbddd.kwcoco.zip
bafybeidx2xae2hgdg5djzz4kox2vizytqtwgqogj7r37omwasxdlkvblv4 28277 vali_imgs84_f4c3d117.kwcoco.zip
However, when the external user tried to run the ls
command, it ran for over a half an hour with no response before they killed it. That is a big UX problem.
I have verified that they were able to access a single image from the original dataset (which I believe is pinned by more than just me) via:
ipfs get bafybeigueauk5udaoeq3cjedqz4usm4xxcpk4acv5z5rml54sksaiqnd7i -o IMG_20201112_112429442.jpg
But I have not verified if they can access any newer data yet. I’m going to have them try this comment to test that:
ipfs get bafybeifdxozsmyvks3pshj2qujhpbwwkaeu7cd6vvuwphyd5zunjdtezbi -o PXL_20231116_135031922.jpg
I’m likely the only person pinning that data at the moment.
I’m not sure what could cause the ls
to hang for 30 minutes like that. Is it just trying to find my node and failing? Is there anything I could do in order to make external access to the data easier?