How to automaticaly copy web pages to IPFS while surfing

qo-op · January 25, 2022, 1:48am

I am located in a place with varying connection QOS (4G in a forest).
Many network timeout during the day, correct bandwidth during the night.

The idea would be to “memorize” web location with timeout through a proxy.
I sent that message to pivoxy mailing list

I would like to add a (post) treatment on proxy cache, extracting all
http:// links to add ressource to IPFS. Then replace URL with
corresponding ipfs gateway One.

Any file copied in IPFS is accessible through its hash.
To “migrate” web site, any link to a static resource in a web page are
rewritten

Here is an example for processing http://www.eautarcie.org/
this line of html code is rewritten like this:

<div class="last"><a href="de/index.html" class="cta" title="KLICKEN SIE HIER">Klicken Sie Hier</a></div>

IPFS transformation.


wget http://www.eautarcie.org/de/index.html

ipfs add -w index.html

added QmSVS5DjuVw2VCvHTQGX5fvLfjt6yEugtKY8gKP2XAhisg index.html
added QmVpazeFPeB66TrDwUCz1sJ9Q7oxPAu7zqAjUDTrvz353P

HTML code modification

href="/ipfs/QmVpazeFPeB66TrDwUCz1sJ9Q7oxPAu7zqAjUDTrvz353P/index.html*"
class="cta" title="KLICKEN SIE HIER">Klicken Sie Hier</a></div>

I could use htttrack, wget, grep, awk to do it.
Then I wonder if this treatment could be fired from provoxy rules and
actions?

This seems possible

Privoxy has experimental support for external filters which can
be written in any programming language your system supports:
https://www.privoxy.org/user-manual/actions-file.html#EXTERNAL-FILTER
https://www.privoxy.org/user-manual/filter-file.html#EXTERNAL-FILTER-SYNTAX

If external filters work on your operating system you should
be able to use them to achieve your goal.

You can check http://p.p/show-status to see if FEATURE_EXTERNAL_FILTERS
has been compiled in.

I wonder if anyone had made any proxy2ipfs solution already ?

If anyone is interested, I’ll continue to report…

ipfsme · January 25, 2022, 12:54pm

proxy2ipfs

That’s a neat idea. I like it…

IPFS Companion has the beginnings of the your proxy2ipfs. If you’re running a local node and IPFS Companion browser extension, you can import any webpage into your local node by selecting that option in the alt-button click in your browser.

However, IPFS Companion does not download all supporting links in the page nor change the HTML to reflect a local IPFS storage of any page materials. There’s a reader mode or print page to file in some browsers which could probably be repurposed for obtaining the images or other materials linked in the page. Then it’s a matter of storing the original link and the new local IPFS link in a table, probably sqlite.

da2w · January 25, 2022, 2:15pm

Keep in mind that you don’t have a license to redistribute random webpages due to copyright laws. They’ll contain different copyrighted materials, and you’d be making copies available to others through IPFS without the express permission of the intellectual property rights owners.

ipfsme · January 25, 2022, 3:15pm

It’s quite possible to configure an IPFS node to not have any connections beyond localhost. In fact, I have several nodes running as single nodes without any connections to any other nodes. IPFS is an convenient storage method (especially with the newest de-duplication in v0.12) … and if one opens the gateway to the LAN, it’s easy to grab items from the storage using just a browser.

From OP’s description, the proxy2ipfs would function as a local only storage of material and not a WAN accessible distribution point of material - copyrighted or otherwise.

zacharywhitley · January 27, 2022, 4:44pm

That would mean that every caching proxy is in violation of copyright law.

qo-op · January 29, 2022, 1:49am

I am glad you like the idea.

de-duplication & increasing availability are really neat features.

Our community is acting as a solarpunk civilisaton demonstration.
We are collecting and using educational and informative materials (web site, video, sound) used in courses, workshops or different artistic mashups. Our network is mainly offgrid and made by unconnected LAN.

Every one is bringing its own material or need access to it, connected (this is where proxy2ipfs would be useful). Or unconnected. For now we are experimenting a “friend of a friend” ipfs storage where every stations publishes on self IPNS a webpage containing own published resources plus the one from their friends (they keep a copy in case it goes offline). This maintains this chaotic storage area almost available.

@da2w about copyright concern, I think all this actual web is illegal because it is full of caching, proxing methods to keep performance and speed. Even google is feeding its search engine with billions of copyrighted stuff…

da2w · January 31, 2022, 3:34pm

No, they’re not. There’s a caching exception in US and other national copyright laws for caches. However, IPFS doesn’t meet the requirements for the exception. Specifically, it lacks a mechanism for the content owner to set an expiration date, request removal, or exempt their content from being cached.

zacharywhitley · January 31, 2022, 3:51pm

You seem to be correct. IANAA and haven’t read through the entire thing yet but it is interesting 17 U.S. Code § 512 - Limitations on liability relating to material online | U.S. Code | US Law | LII / Legal Information Institute

qo-op · February 1, 2022, 1:26pm

In my opinion a central control over p2p storage is impossible to obtain.
And if it was established by law, it would be a great brake for spontaneous creativity.

It seems to me that NFT are discreetly preparing this new “hashed linked crypto web space”…

It is best that data manage it’s rights by its own.
In our prototype we have clear ipfs address published or ipns (pointing to html) forcing a step before accessing. There any can put the best suitable “contract” to execute. A kind of meta protocol on how data is written and published over ipfs.

In this mediacenter page example. One is redirected after being encouraged to tip, it could wait for payment or verify user rights.

So, technically, it is really not a problem to maintain creators rights. It becomes one when a central entity is trying to enforce it.

ipfsme · February 1, 2022, 3:19pm

Publishers have tried to discourage copying materials ever since the printing press was invented.

A single human with a flat bed scanner is a modern publishers nightmare. Almost all copyright laws are in place to protect the profit margins for the publisher rather than the author. In the US, Congress somehow enacted RETROACTIVE extension of copyright… this is crazy… since the whole premise of copyright is a limited exclusive right to publish in exchange for publishing the creative work. In essence if a given Copyright length was acceptable to the author at the time of creation of the work, the Copyright law at the time was an acceptable and fair agreement to the author.

So, the government stole from the Public Domain and gave the profits to large Publishers.

After stating the ideological and practical limitations, I strongly encourage everyone to respect Big Publishing Houses and their lobbyists in governments world-wide… and not put anything protected by Copyright into public IPFS nodes.

Topic		Replies	Views
Mirroring standard websites to IPFS as you browse them Ecosystem and Usage	4	1523	February 26, 2021
Create a more fluent transition between people having the client running and not	2	319	April 16, 2021
Web browser with integrated IPFS node/support for browser cache? Ecosystem and Usage use-cases-and-apps	7	2208	April 1, 2018
IPFS HTTP Proxy Help	1	486	March 29, 2022
Any suggestion to make IPFS content searchable/addressable by a user defined tag? Help	23	5484	May 27, 2020

How to automaticaly copy web pages to IPFS while surfing

Related topics