Community-driven website archive on IPFS

Summary: given a web resource with an enormous amount of data, how would you organize a community-driven archiving of its contents to IPFS?

A wonderful short video hosting, coub.com, is closing soon. Many users and content creators are grieving over this event. I am thinking about archiving its content in IPFS.

Writing a standalone program that downloads video files and uploads them to IPFS seems like a doable job to me. The problem is, there is enormous amount of content there, so the archiving should be performed by a collective of enthusiasts, where each one is running the archiving software, not by a single person. All those processes must somehow collaborate in order to incrementally grow a single public tree of data. By collaboration I mean that:

  1. A newly joined process must somehow obtain the list of already downloaded videos (in order to not perform redundant work)
  2. After a process downloads a video it must somehow get included in the common tree.

I have little idea about how to implement such a network of processes.

How would you design it?
Are there some sources of relevant information (existing projects, articles, etc) which might be useful for the job?

Notes:

  • In practice, we have to deal somehow with the corrupted data (being a result of an accidental error or of someone’s bad intentions). For the sake of keeping the topic as simple as possible let’s suppose that there is no such problem
  • The desired end result is to create a nice user interface (a website like en.wikipedia-on-ipfs.org/wiki or a mobile app), but let’s focus on just moving the data into IPFS in the simplest form possilble

Given current situation that IPFS is far from complete, I would create a github repository and coordinate the tasks with it. Tasks are distributed through it and CIDs are gathered and added to the IPFS folder by the moderators, and publish it finally

Remember that the same video will give you the same CID (with default settings) so in theory you don’t have to split the work. With a common tool (video → ipfs → CID) coordination would not be needed.

The coordination would be in creating an indexing file for the entire catalogue.

edit: with this scheme redundancy would happen naturally for popular content and video that no one care about would not be archived.

You might end up downloading the same video again and again by different nodes.

Yes! And that’s a good thing.

My methodology would be to let everyone add their favourite videos to IPFS with a tool guaranteeing the same video equal same CID.

This way the most popular videos would be the most available on IPFS.

1 Like

Then you should ask users to pin it, but according to the traditions IPFS doesn’t need users to do the heavy lifting :sweat_smile:, which I don’t quite agree though.

Maybe using the video metadata as a hash to create unique index? this way it is even less likely to end up with duplicates.

For the metadata I would use IPLD then link all of it together to form some kind of indexing system.

1 Like

I think what you want is a collaborative cluster: