Weāre doing a round of interviews to understand the needs and priorities for people who have large volumes of data (dozens of Terabytes to Petabytes) to put on IPFS. One of the things I want to do in those interviews is test our assumptions about the factors that make IPFS more or less appealing for people who are dealing with data on that scale. I have initial guesses but would love to hear peopleās ideas about other factors we should consider.
My initial guesses at the distinguishing features/functionality that people find compelling:
Content Addressing
Ability to move, replicate and re-provide data without compromising the integrity of the data
Ability to sub-select a portion of a dataset without replicating (or citing) the whole thing
Provides a basis for versioning (content-addressed) but Agnostic about Versioning metadata structure
Supports efficient on-the-fly aggregation and analysis of data from multiple locations (see poster about Hadoop on IPFS
My guesses at the factors people will consider when evaluating IPFS:
I can only speak to my own experience, but that is over 20 years long now. I will give you all the reasons I want to use IPFS in the near future for Data Roads Foundation projects, as well as some features that I would have loved when I was maintaining IT and tools in the video game industry, where our internal repositories were already in the 10ās of Terabytes range a decade ago.
Near future applications:
Transparent caching proxy store (eg. Squid backend) for Data Roads mesh edge-nodes/firewalls, potentially with multiple nodes coordinating one large IPFS repository as a Ceph cluster ā locally distributing an IPFS repository subset with stripes and parity or erasure coding, based on mesh cluster-local browser requests.
CDN backend for Data Roads Internet proxy and Unwatch.Me VPN tunnel nodes, to avoid repeat sends per mesh cluster, and to send wire compressed delta updates instead of full new files when available. This CDN will also act as a sort of reverse-proxy and load balancer for serving Data Roads Co-Op customer sites.
These CDN and caching proxy systems above will likely coordinate with each other regionally ā preferably with some predictive synchronization based on user demand patterns or PubSub registrations, and some capacity for sending wire compressed deltas in encrypted multipath tunnels. To avoid overfilling small cluster storage volumes and quotas, they will also need to be priority-bucketed request timestamp FIFO Garbage Collection caches.
Features that would have allowed me to replace expensive and inefficient media version control and repository systems in the video game industry, like AlienBrain and Perforce:
Version control system for large binary files, with support for wire and storage deltas and visual diff (eg. false-color-delta image diff), or master-write locking mechanisms where no diff applications are available (eg. most 3D formats, with distributed multi-master-write locks). Branch cherry picking and partial-history sync are necessary to avoid overfilling local client workstation volumes.
Some compiler and media transcode or lossy-compression pipelines can also be accelerated by storing recent mid-process binary artefacts, wherein workstations performing local rebuilds can pick up from the mid-process artefacts as a sort of āsave pointā, and only perform complete rebuilds from scratch on a minority of locally-edited files. (This is how we sped up per-workstation build and test pipelines ā way before buildbots, CI/CD, and fast interpreters/compilers like Go were a thing).
Console target executable and data distribution, with output version tagging and multi-version binary deduplication. During the early PSP and late PS2 era, we sometimes had source art and code that compiled down to different binaries and degrees of lossiness in compression dependent on the target console, in the same way you might produce different build process outputs when targeting desktop and mobile devices from the same codebase today, natively sans JIT.
Related to the transparent caching proxy and CDN projects above, I would also like to partner with Protocol Labs and the Internet Archive to produce a series of ātop sitesā and ātop 3 recent versionsā of pre-filled hard drives for B1G1 style public sale. Each TB+ drive would contain an IPFS repository subset of the Internet Archive and some cross-platform installation tools for IPFS, so that anyone receiving a drive could quickly have LAN access to the hard-drivesā capacity of compressed recent popular Internet history. We have already worked with Internet-In-A-Box.org on a similar project, but IPFS and LibP2P would open the way toward updating these local Internet caches for community meshes worldwide on a regular basis, using whatever means available ā including DTN bundled delta updates over slow satellite connections, and periodic physical drive deliveries.
Please let me know if I should input any of this as IPFS or LibP2P project proposals on GitHub or elsewhere. I want to contribute to building out these products, and will actively recruit Data Roads dev volunteers to help.
Notes on the ancient history of binary version control and distribution systems:
Back in the days before Git and Mercurial were v1.0, and Subversion was not yet an Apache project, I solved some of these problems using SVK ā a little-known Perl toolset that wrangled distributed Subversion repositories and managed their synchronization, to resemble a distributed version control system (DVCS). This SVK-SVN-sync solution allowed me to use all the binary Xdelta compression, master repo HEAD file locking, and cherry picking capabilities of Subversion; while I could also distribute full or partial repositories among workstations better than Mercurial. Both SVN and Hg however lacked the āobliterateā commands and history depth-limiting configuration options of Perforce, which made minor disk-hogging mistakes major headaches, so we were often forced to use expensive Perforce licenses for most large 3D and art file version tracking. Sadly, Audrey Tang seemed to have dropped development of SVK before I left High Impact Games, and my bosses in the games industry never gave me permission to contribute back any of our bespoke solutions to Open Source communities, as I had desired.
Iām new to IPFS so hopefully my comments arenāt out of of the lane you intended.
Iād add simplicity and integration to your list of factors. Organizations will have some sort of storage virtualization technology and theyāll want IPFS to plug into that. Theyāll also have content and data management systems (beyond Hadoop) that theoretically shouldnāt need to know about IPFS but probably needs to be confirmed. For instance, Iām digging into video (specifically all the bodycam/dashboard cam data being generated). Itās not saved for very long today since is too expensive. Governments arenāt going to want to hire extra people to use IPFS if canāt use an existing interface.
I would very much like to start using IPFS but I am nervous of doing so as I am on a metered WiFi connection in the middle of the Southern Highlands of Scotland, where the phone company buried shit aluminium phone cable 50 years ago and I am so far from the switch that the impedance kills any chance of usable bandwidth. But its not all bad, at least I get 28Mbs down and 40Mbs up, with my 4G connection. Its just so expensive and of course its not built for volume.
I just went through a long and productive meeting with Jeremy from the Go team going over some of the issues that hold up making large amounts of data available from archive.orgās perspective, I donāt know if he went over all the points but Iāll outline them here, and we can cover on Friday if your team will be around long enough and there arenāt too many other items on agenda.
In brief in our test architecture, we are putting the data in on the server, weād like to do it directly, but currently it goes through a local IPFS (Go) instance. We are then trying to access on the browser.
We are assuming that many of these issues will eventually get solved, till then its hard to add large (petabytes) of data to IPFS.
Architecture
Content Addressing requires a whole new data structure (IPLD), its not true content addressing of the actual content and we cantā cross-check IPLDās because donāt contain content addressing.
Ongoing issues with links not being self-describing, and guessing wrong about what type of link you have causing crashes
IPNS is unusable, I think you are aware of the issues so wonāt repeat in this topic.
And of course lack of Python implementation means that cant really build anything on the server side, or fix any of the issues there.
Poor support for mutability,
Poor cross-platform support, e.g. lack of mime-types on content, cant use URLs as external links etc.
Server side:
No external IPLD builder in Python, so we cant automate the add process without the slow process of going through a local IPFS instance.
Duplication - all content occupies double disk space even before anyone accesses it (should be fixed if/when the url store extension is completed)
Content added on server is not available on Browser without machinations to get around bugs/limitations.
Browser/Javascript issues;
Lack of browser/goserver comparability - e.g. the bug where content added on go server is not available on browser without pinging the ipfs.io gateway
Centralized points of failure - with WebRTC not working on browser, web sockets star has a single point of failure.
Reliability - had a lot of issues here, and not helped that its hard to spot errors among the stream of messages in console.log that arenāt really errors.
Crashes in background threads that arenāt seen in foreground.
Persistence - anything added by a browser disappears as soon as the browser closes (I understand the clustering project might fix this when ready)
You donāt have code complexity in your list. I would guess that the potential of only writing one software stack for a client and not having to write one software for a client and the other for the server would be an advantage.
Thank you to everyone who has posted so far in this thread. The discussion is bringing up even more good points than I expected. I hope I can funnel it all into the next round of UX planning and implementation. As conversation here slows down Iām gathering the info and Iāll try to post a digested rundown of observations people have offered.
It looks like @mitraās post might have squashed the energy of contribution and dialogue that flared up in this thread. @mitra, Iām trying to hear peopleās insights about motivations for putting large volumes of data on IPFS. Iām seeking feedback thatās speculative and forward-looking. We will use this info to ensure that we prioritize efforts around features, documentation, performance, etc. correctly. By contrast, your post is more like a report of the things you have found frustrating with the current implementation of IPFS. While itās fine to discuss those things on the IPFS forums and the feedback is relevant for our efforts to support large volumes on IPFS, it doesnāt fit in this particular thread. Youāve put me in an odd position because many of the conclusions you offer are either inaccurate, misleading in the way theyāre worded, or downright wrong. I want to address that misinformation but that that would derail the focus of this thread, which is producing extremely useful and informative discussion.
Reading between the lines of @mitraās post, I see some motivations and important features to note:
Interoperable, self-describing content addressed identifiers are very important
Great client libraries in languages like Python or Rust, or possibly complete protocol implementations in those languages, are valuable
Mutability is important ā we need reliable, performant ways to propagate updates to a dataset and query the ācurrentā version of a dataset (ie. ipfs-pubsub or IPNS)
Explicit structures for tracking metadata like mime types, etc., and integrating that info back into the headers exposed by http gateways
Ways to avoid duplication, especially by registering existing data in-place on the filesystem, as with ipfs-pack or the url store feature weāre building with archive.org
Interoperability between go-ipfs and js-ipfs is a MUST ā we need to be able to post data on a go-ipfs node and then consume it from a web browser using js-ipfs (and vice versa)
@ChristianKl one way I tend to think of this is that it allows us to switch to thinking of everything as nodes, services and workers in a broad system, where the location is incidental and changeable based on needs ā for example it allows you to blur the distinction between server-side and client-side analysis. Instead of forcing a dichotomy of server-side vs client-side, it lets you think in terms of performing analysis on a device thatās close to the data, on a device thatās further away, or to replicate the data to a new location and analyze it there. In a way this simplifies your code base because it lets you write little libraries and services that can be reused in client applications, workers, etc. regardless of where theyāre run.
Do you think that is a good way to talk about the point youāre making about code complexity, or does it confuse things?
Between the http gateways, which give you backwards-compatibility with www-based apps, and the emphasis on making the command line interfaces conform to unix and posix conventions, I tend to think that we have a high level of support for this kind of interoperability. Can you think of other ways people would want to integrate with something like IPFS, which operates at the data persistence layer?
Can you give an example of their alternatives? What kind of existing interfaces would they already have familiarity with?
All good points Matt, and certainly donāt want to shut down dialogue or derail the thread, but you did ask "the factors that make IPFS more or less appealing for people who are dealing with data on that scaleā and these are all factors that have slowed down our attempts to put data and the apps that use it onto IPFS - something weād like very much to do.
I donāt want to detail the thread, so lets take detailed discussion offline, note I already emailed you to try and get some technical time on Friday prior to our broader meeting but didnāt hear back - Iāll resend and if that time isnāt available, then Iād love to know (email off-thread is fine) anything Iāve got ādownright wrongā above, one challenge has been that its been hard to get technical engagement to address these issues.
Between the http gateways, which give you backwards-compatibility with www-based apps, and the emphasis on making the command line interfaces conform to unix and posix conventions, I tend to think that we have a high level of support for this kind of interoperability. Can you think of other ways people would want to integrate with something like IPFS, which operates at the data persistence layer?
Not to step on tkkleinās good point, but one quick answer I have for this new question is: VCS client interfaces, similar to (or directly compatible with) tailor. WebDAV is a related and even more widely used interface, but I already see that on the gateway Issues list. I would think IPNS-FUSE already takes care of a lot of other potential *nix backend integrations.
@flyingzumwalt : I think thatās roughly what I mean and I would expect that as IPFS matures that will provide significant value for organizations. When it comes to the question about how to best speak about the point, I donāt know what the best way happens to be.
Most companies store such large workloads on EMC Isilon or Netapp, who all have limitations on the four factors you listed above as to why use IPFS. I work on the sales side in storage but can say that almost all of my customers are looking to dump large archive workloads to AWS or Azure - this is always the low hanging fruit. So, archive use cases could be an interesting play especially in industries that generate PBās of data like Media or Research
Iāve just recently found out about IPFS and to me it seems like it can potentially be really positive for science reproducibility.
In my particular research community, large (up to around 10TB) binary files are generated through very time-consuming simulations. Storing them appropriately is a big deal (losing files means having to repeat simulations that can span several months). Sharing them with colleagues is of course also really important and is something that is not always doable in practice, unfortunately. For example, I canāt download simulation datasets of several Terabytes that are hosted at Stanfordās repository, since I am based in Europe, and would take me an absurdly long time to do so.
From what Iāve gathered in my short time reading about IPFS, the whole point is to increase file sharing speed through talking to your nearest neighbour in the network, and not necessarily a central repository. But Iāve also read that duplication is avoided, and that each node in the network stores only content it is āinterestedā in. Therefore, in the case that I mentioned before, how would IPFS decide who stores these large datasets? Wouldnāt it be too costly to have them duplicated? If so, we would be back at the situation that I am now: downloading a huge dataset from across the globe is infeasible.
Iām interested in reading comments on this from more knowledgeable members of the IPFS community
Hi, I work in a Web user behavior analysis company, you can compare to the Google Analysis. And the tracking code generates several TBs of data every day. And we store them in AWS S3 setting the expiration so that limit the total volumes to hundreds of Terabytes. We are seeking the ways to reduce duplication of data stored so that we can save money.
There are millions of sessions per day, that means we will have millions of ipfs nodes (short-lived, from seconds to tens of minutes) across the web once we deploy the js-ipfs on it. I believe that may release the most potential of IPFS.
OK, back to the point. Basically, we are watching and recording all the DOM changes happens on the page while the users are visiting the site so that we can restore the session in the future for analysis. Currently, we need the following things:
The version control or The Tree Object mentioned in IPFS white paper 3.6.3. Right now we are using a diff algorithm to calculate the DOM changes. And store both the origin and the diffs into files. I believe if the IPFS Tree Object is guaranteed. We would reduce many duplications and save much space.
Reliable push (or upload) method. Iāve tried PubSub for a demo, seems that the receive of the content is not guaranteed yet. Since the tab can be closed at any time. Itās very important for us to push the data to backend within microseconds. (Well, there may be some walkarounds.
(Iāll add more when I come up with.)
@jeiros I think youāre pretty much correct in what you say. Maybe a few points for thought:
Depending on your workflow, it may be acceptable to retrieve only some of the dataset, e.g. for a given piece of analysis you only need to retrieve a subset of files from a given simulation. This is much easier with IPFS than some traditional data repositories. Also, if you add those files to your IPFS node, you automatically make it easier/faster for European colleagues to get those particular files. It sounds as if your data is a single binary file, though?
The intention is for there to be different importers for IPFS to optimize chunking of specific content types, e.g. video, HDF5 files? In theory this could help with de-duplication (i.e. de-duplicate content across multiple simulations) and streaming of content. I donāt know how this would apply in your case. https://github.com/ipfs/specs/tree/master/dex
Presumably the original data archive is duplicating or triplicating the data, i.e. through backup etc. A collaborative/co-operative approach to data archiving could meet these backup requirements while improving access requirements. The tricky thing here is governance, but there is good precedence with things like LOCKSS (https://www.lockss.org/). One could imagine a bi-lateral undertaking between the European Open Science Cloud and the US equivalent, or between collaborating centres in a given domain of science. So, youād have sponsored/trusted nodes pinning the content much as data repositories do now, which is then supplemented by ephemeral nodes who temporarily pin content or pin content that is of interest to them (e.g. a research group pins a dataset it uses regularly; an institution pins content produced by its researchers, etc.). IPFS itself wonāt help with the governance issues, but Filecoin might help incentivise third-party replication. Ultimately, though, archiving of scientific data is a public good, and different economics apply: https://www.biorxiv.org/content/biorxiv/early/2017/03/14/116756.full.pdf
We developed our Secure Peer Assist (SPA) technology to overcome the issues with distributing video (movie) and other large files via the Internet. It is now approved by one of the biggest studios in Hollywood with more to come. We identified, very early on, the need for a file system. We were aware of some of the work around content addressing and new models for the Internet (although not specifically IPFS) but were very much aware of our limitations as a startup and felt these were outside our remit. So, we specified our own version and included its development in our budgets. It seems that IPFS has come along at the perfect time to meet that requirement. While it seems to be early days in its development, that is a good thing in that it enables us to contribute and influence its direction. Weāre optimistic that recent developments in Filecoin and crypto-currencies in general will also help accelerate that significantly.
Why we like IPFS
It fits our architecture, philosophy and values perfectly (assuming positive answers to our high level questions in the IPFS discussion forum here GT Systems IPFS and Filecoin questions)
It supports hashās and therefore content addressing and DHT
It scales ā BIG ā hopefully to Exabytes and beyond
It becomes more efficient as it scales
It has no single point of failure and shards can continue to function
It ISNāT BitTorrent, which makes it more acceptable to the studios (but, again, see our questions around security)
It encrypts files at rest. Currently, we use PlayReady 3 to do that because it is acceptable to the studios. Hopefully, as we continue to work with the studios and introduce IPFS, we may be able to use the native IPFS encryption. That will depend on how secure it is and will require an extension of the journey we have been on for 10 years with the studios. But, right now, our architecture (including PR3) is approved. If we can make it work with IPFS, we are good to go with one of the best catalogues in the world, with more to come.
Combined with Filecoin and our technology, IPFS provides the perfect mechanism for our customers to share movies. It fits our business plan and business model perfectly. Using our relationships, it overcomes ALL the issues (tech and business) of distributing movies via the Internet. This is based on a VERY deep understanding of the real tech and business issues and motivators, gained from working with all the Hollywood studios and Indies on digital distribution for 10 years.
Given certain assumptions, we think we may be able to significantly reduce the cost of movies to consumers, while keeping rights owners (studios and indies) happy.
It will allow us to come to market MUCH quicker and requires MUCH less funding.
Between us, we can change the way movies and TV are distributed and sold for the foreseeable future and help fix the Internet. We like that very much.
Iām working on a side project for creating a database for learning materials, including large media files. Such a database could be pretty big, perhaps not dozens of terabytes but still sizeable. One thing I want for this database is for it the be decentralized where many people can pitch in to host it, and versioned using a graph of trust (like the linux kernel), rather than allow-edits-then-fix like Wikipedia. For that Iām developing a DVCS on top of IPFS.