For an experiment I am performing to get insight into how data is made available within IPFS, I have a private network consisting of about 10 nodes. A file is published on one of the node and then it is retrieved from the other nodes.
From inspecting to the bit swap ledger, I notice that it seems close to 90% of data is still being provided by the node that publishes it. I was not expecting that the chunk of the data comes from one node.
Is there an explanation for this? I assume this is a function of the DHT right? And is it possible to influence the data distribution where data are fetched evenly from other nodes?
The most important thing to note is that IPFS is not free permanent data storage. At a high level the DHT has a bunch of peers that have volunteered to provide some service to the network, keeping the resource utilization of this volunteer service low is important as it incentivizes higher usage. If you could just ask random people on the internet to store 1TB of data for you that would likely lead to serious problems both in adoption (I don’t have the disk space, bandwidth, or motivation to store TBs of data for random people online) and legality (I’m not a lawyer and cannot provide legal advice, but if I were to store/provide illegal content just because a random person online asked me to that seems like it would be bad news).
So what does the DHT do? For immutable IPFS data the DHT handles provider records these are effectively advertisements where you tell the network “I have file F”. This makes it possible for you to then ask IPFS “get me F” and it will ask the DHT where to find F and then use bitswap to actually get the data from the peers that have advertised having it. This is as opposed to the classic web where if data used to be stored at dnsdomain1 .com/myfile and then went offline I have to use a search engine to look for “myfile” and see if there’s anywhere else where it might be. Since the data is content addressed and there’s a shared “advertising” space we can find the data no matter where it might be.
Side Note: Bitswap will also just ask peers you’re connected to if they have the data so doing a DHT search isn’t necessary if you’re already connected to someone who has the data you’re looking for
Side Note: By default when you download data you also advertise to the network that people can download the data from you. This means that if you waited a while (however long a reprovide cycle is in your config file) you would notice that adding an 11th node would download data from all 10 pre-existing nodes.
This is the crucial part. In my setup I have nodes downloading the same data from the node that first published it. My expectation was that after a while the data would be available on other nodes and further retrieval won’t be only from the original node that published the data. I do observe this, I just expected more of the content to come from other nodes.
My observation was that over 90% of the data was still coming from the originating node, while other pieces came from the nodes that have already downloaded the content. My main question then is: is there a way to ensure that more of the content come from other nodes that have previously download the data?
This sounds interesting. I’ll go find what the default configuration for this is. And observe if I changing it, changes how fast/soon other nodes can get to provide the data
I quickly checked the documentation here and I also see info on how to trigger a reprovide. Know where the setting for how long a reprovide cycle is found?