Reasons why IPFS is a powerful tool for Machine Learning

I have compiled 5 compelling reasons why a Machine Learning team, especially, MLOps team should adopt IPFS in their technology stack:

1 Like

I haven’t even read the post yet but I’m going to reply with +1 just based on the title. Yes, yes, yes IPFS+AI/ML

If you don’t mind I’m going to add a couple to your list.

  • Reproducibility. How do you know what images a model was trained on. Usually it’s buried in a README somewhere. "This model was trained on MS-COCO, ImageNet, etc… But how do you know what exactly. Were one or two images left out? Were they pre processed so they were sort of like ImageNet. Did you do some image augmentation that applied translations, rotations, etc? With IPFS you can refer to your dataset unambiguously with a single hash allowing someone else to download the exact images you used.

  • Deduplication. You shouldn’t need to copy your entire dataset just to add or remove and image or two. With IPFS creating a new dataset of existing images is almost free allowing you to generate as many test or validation datasets as you’d like without copying data.

  • Distribution of models. Many pertained models are available either throw opaque data loading classes where you don’t even know where they come from. They’re convenient but wasteful and prone to break if there’s ever a problem with the service. They can be large and downloading them multiple times on the same machine wastes bandwidth and disk space. It’s especially silly if the service is down but the file you need is sitting on the machine right next to you. With IPFS you can easily support offline operation while allowing you to get the model from any machine that can provide it.

Tracking models. Just like tracking the data, tracking the models is difficult. Which model are you running? Where did you get it from? How do you know it’s the right one? If models are distributed via IPFS you know exactly what you’re running.

Easily share results with colleagues. Want to share that custom dataset? No problem just send the IPFS hash. Want to share you’re training logs. No problem publish that on IPFS. Want to share the latest model snapshot to have someone check it out? No problem publish that on IPFS.

Jupiter notebooks on IPFS. Again more sharing goodness and image annotation should definitely be powered by an IPFS backed web app. Anyone should be able to annotate any image.

Hey, thanks. Awesome list. Let me take some time to rework and update my article.

Thanks for the great vector database. I remember checking it out a couple of years ago and it looks like you have made steady progress. I’ll have to swing back around and see what’s new.

I’ve been really surprised that there seems to be so little interest in IPFS from the AI world. The closest thing seems to be Academic Torrents but that’s not even close to what you can do with IPFS. There are a ton of GitHub projects that have a “pretrained models can be found here” and give a gdrive, dropbox or whatever address that don’t work anymore.

I’d love to see something like Holium used for defining image processing pipelines and using IPFS to cache the intermediate results. Also maybe have AquilaDB listen to a pubsub channel?

1 Like

Hi! I’m David Aronchick and I’m co-director of Research Development at Protocol Labs. If you’re interested in any of the above (and more!), we’d love to have you in a project SPECIFICALLY designed to address IPFS+AI/ML :slight_smile: GitHub - filecoin-project/bacalhau

Come check it out!