What motivates people to use IPFS for large volumes of data?

Jared4DataRoads · January 8, 2018, 8:02pm

I can only speak to my own experience, but that is over 20 years long now. I will give you all the reasons I want to use IPFS in the near future for Data Roads Foundation projects, as well as some features that I would have loved when I was maintaining IT and tools in the video game industry, where our internal repositories were already in the 10’s of Terabytes range a decade ago.

Near future applications:

Transparent caching proxy store (eg. Squid backend) for Data Roads mesh edge-nodes/firewalls, potentially with multiple nodes coordinating one large IPFS repository as a Ceph cluster – locally distributing an IPFS repository subset with stripes and parity or erasure coding, based on mesh cluster-local browser requests.
CDN backend for Data Roads Internet proxy and Unwatch.Me VPN tunnel nodes, to avoid repeat sends per mesh cluster, and to send wire compressed delta updates instead of full new files when available. This CDN will also act as a sort of reverse-proxy and load balancer for serving Data Roads Co-Op customer sites.
- These CDN and caching proxy systems above will likely coordinate with each other regionally – preferably with some predictive synchronization based on user demand patterns or PubSub registrations, and some capacity for sending wire compressed deltas in encrypted multipath tunnels. To avoid overfilling small cluster storage volumes and quotas, they will also need to be priority-bucketed request timestamp FIFO Garbage Collection caches.

Features that would have allowed me to replace expensive and inefficient media version control and repository systems in the video game industry, like AlienBrain and Perforce:

Version control system for large binary files, with support for wire and storage deltas and visual diff (eg. false-color-delta image diff), or master-write locking mechanisms where no diff applications are available (eg. most 3D formats, with distributed multi-master-write locks). Branch cherry picking and partial-history sync are necessary to avoid overfilling local client workstation volumes.
- Some compiler and media transcode or lossy-compression pipelines can also be accelerated by storing recent mid-process binary artefacts, wherein workstations performing local rebuilds can pick up from the mid-process artefacts as a sort of “save point”, and only perform complete rebuilds from scratch on a minority of locally-edited files. (This is how we sped up per-workstation build and test pipelines – way before buildbots, CI/CD, and fast interpreters/compilers like Go were a thing).
Console target executable and data distribution, with output version tagging and multi-version binary deduplication. During the early PSP and late PS2 era, we sometimes had source art and code that compiled down to different binaries and degrees of lossiness in compression dependent on the target console, in the same way you might produce different build process outputs when targeting desktop and mobile devices from the same codebase today, natively sans JIT.

Related to the transparent caching proxy and CDN projects above, I would also like to partner with Protocol Labs and the Internet Archive to produce a series of “top sites” and “top 3 recent versions” of pre-filled hard drives for B1G1 style public sale. Each TB+ drive would contain an IPFS repository subset of the Internet Archive and some cross-platform installation tools for IPFS, so that anyone receiving a drive could quickly have LAN access to the hard-drives’ capacity of compressed recent popular Internet history. We have already worked with Internet-In-A-Box.org on a similar project, but IPFS and LibP2P would open the way toward updating these local Internet caches for community meshes worldwide on a regular basis, using whatever means available – including DTN bundled delta updates over slow satellite connections, and periodic physical drive deliveries.

Please let me know if I should input any of this as IPFS or LibP2P project proposals on GitHub or elsewhere. I want to contribute to building out these products, and will actively recruit Data Roads dev volunteers to help.

Notes on the ancient history of binary version control and distribution systems:

Back in the days before Git and Mercurial were v1.0, and Subversion was not yet an Apache project, I solved some of these problems using SVK – a little-known Perl toolset that wrangled distributed Subversion repositories and managed their synchronization, to resemble a distributed version control system (DVCS). This SVK-SVN-sync solution allowed me to use all the binary Xdelta compression, master repo HEAD file locking, and cherry picking capabilities of Subversion; while I could also distribute full or partial repositories among workstations better than Mercurial. Both SVN and Hg however lacked the “obliterate” commands and history depth-limiting configuration options of Perforce, which made minor disk-hogging mistakes major headaches, so we were often forced to use expensive Perforce licenses for most large 3D and art file version tracking. Sadly, Audrey Tang seemed to have dropped development of SVK before I left High Impact Games, and my bosses in the games industry never gave me permission to contribute back any of our bespoke solutions to Open Source communities, as I had desired.

Topic		Replies	Views
IPFS for community-led research Help research	6	2155	May 17, 2017
Work-plans for kubo, helia, & other Shipyard IPFS projects in 2025 kubo , helia	12	563	December 12, 2024
GT Systems IPFS and Filecoin questions Help	24	2426	February 19, 2018
Feasibility for Self-Hosting Scientific Datasets? Help go-ipfs , kubo	10	395	December 21, 2023
Questions after first learning about IPFS Help	12	1704	May 23, 2017

What motivates people to use IPFS for large volumes of data?

Related topics