IPFS file distribution clarification

Gabriel · October 29, 2020, 10:21pm

Hi,

I am new to IPFS, I initially understood that if I have an IPFS node and that I write a file to it that if my node disappears that my file will not (a bit like a distributed RAID 5 system) : “every file on IPFS can be hosted in many different places, yet accessed from the same address. If one computer hosting the file goes offline, the network will just retrieve the file from another computer.”

But while reading around it seems that this is not the case somehow, my files are not duplicated : “IPFS removes duplications across the network”. I also read that if nobody requests my files that they stay on my node (meaning my files can disappear if my node goes down). In case another user requests a file and it is copied to another node, it seems to contradict the fact that duplicates are removed.

I am trying to store publicly available files that should stay publicly available, even if my node disappears.

I think there is something I am missing here, can someone please explain it to me?

Thanks,
Gabriel

ldeffenb · October 29, 2020, 11:17pm

IPFS does NOT copy stored data anywhere when it is added to your node. Whenever the file is initially viewed, the IPFS DHT is used to locate the content and it will be retrieved at that time from your node. Obviously, if your node is not running, the file(s) will not be available.

Once a file has been accessed elsewhere, any node involved in that distribution MAY cache pieces that they handled. Those pieces will only be there until garbage collected if the node runs out of space.

If any node in the network chooses (or is paid) to PIN your content (hopefully pinned on your own node), then that node will retrieve the content and keep it until it is unpinned. The IPFS DHT provides information of all sources of chunks, so whenever the file is subsequently accessed, chunks may come from anywhere and not necessarily from your node.

The “duplicates” which are removed are related to identical content between chunks of various files. Because every chunk is content-addressed, duplicate contents will share a single CID and will exist only once since the CID is the unique ID of each chunk. In fact, if a given file is uploaded by 2 or more different people, then the chunks will already exist somewhere. That won’t prevent your node from storing and pinning your upload, but when it is fetched, it might come from anywhere.

So, if you want content to be available, you need to either keep your node up or have an always-available node pin your content, potentially for compensation.

Now, if any of this is wrong, please let me know so I can quit spreading mis-information, but this is the way I understand IPFS to work.

atopal · October 30, 2020, 12:59pm

Thanks for asking that here Gabriel, your understanding is not uncommon, but as ldeffenb says, IPFS on its own does not address persistent file storage, you have to use pinning services or some other solution for that. May I ask where you got your understanding from and where you are quoting from? It would be helpful to correct that, if we can.

Gabriel · October 30, 2020, 7:08pm

I quoted Cloudflare and Infura and various posts I found on Stackoverflow and reddit.

Gabriel · October 30, 2020, 7:09pm

Thank you for your explanation, I see now. So unless lots of ppl PIN my content on other nodes, there is no guarantee that in case my node goes down that the information will stay publicly viewable.

markg85 · October 31, 2020, 11:28pm

Sadly IPFS did a very poor job in explaining the “IPFS removes duplications across the network” which others took over. In a very strict sense, this is true.

But it’s so vaguely worded if you just start to get to know IPFS that you will think that any file will only exist once on the entire IPFS network. I also had that very same question when i started to learn more about IPFS.

How it works is each and every file is chunked (cut in small pieces) and hashed. Eventually you get 1 hash for your file. That’s the Qm... hash you see on lost of places. So imagine you add a file of a popular cat image. You type ipfs add <your cat image> which gives you a Qm... hash. Now you added it, this means you pin it by default. So your node now can serve that very file to whoever might request it by the hash. The “deduplication” is in the form of those Qm... hashes. If i add the very same cat file i also get a Qm... hash, the exact same hash as you have. That happens because we’re both (by default) use the same hashing algorithm. So the deduplication here means that we both get the same hash when we add the same file. The file is still added by you (and therefore pinned) and by me (thus also pinned) so now there are 2 nodes on the IPFS network with that same file that can serve it to whoever requests it.

And yes, if you modify just a single pixel in the image you will get another hash

You can add files and have them be accessible when your node is offline. In that case you need to use either a pinning service or get other people to pin your file. Just a note here. Everyone who “fetches” your file automatically becomes a “hoster” (or provider) of that file. It’s not exactly like pinning but you can compare it to that. So in other terms, if you have a popular website that lots of people visit then it becomes hosted by all your visitors.

Hope this helps you understand it

Gabriel · November 2, 2020, 7:56pm

Thank you for this extra explanation!

Topic		Replies	Views
How to understand "IPFS removes duplications across the network" Docs & Tutorials	4	2091	August 28, 2018
What if a node is not available in the network? still can I access the file hosted by that particular node? Ecosystem and Usage use-cases-and-apps	4	948	May 1, 2018
Question about deplication go-ipfs	3	540	December 1, 2018
About the availability and distribution of IPFS Help	3	546	July 1, 2019
Ipfs private network, file access issues Kubo go-ipfs , ipfs-cluster	1	678	June 26, 2018

IPFS file distribution clarification

Related topics