How to modify small sections of a large file stored in IPFS?

meowdada · December 10, 2020, 9:51am

AFAIK, it is impossible modify a file stored in IPFS.
Technically, we can re-upload the file to the IPFS to simulate the “modify” operation.

But the problem is whenever we want to modify a small section of a large file, say 10GB. We have to re-upload such a large stuffs and waste many computation resources on duplicate blocks for just few new added blocks. Although with the help of data deduplication can we save huge spaces from duplicate blocks, we still suffer from time wasting operations that calculates hash of duplicated blocks.

And there is an awesome application called peergos.

As peergos described on their official page (Features):

Peergos can handle arbitrarily large files efficiently. Our maximum file size is far bigger than any other storage provider we are aware of (assuming you have enough space on the server). We can stream large files like videos and start playing immediately, or quickly skip though to a later part. Despite being end-to-end encrypted, we can efficiently modify small sections of large files.

They claimed that peergos can modify small section of large files efficiently.

Does anyone gets any idea on how they implement it or any good idea to solve the problem described as above?

meowdada · December 10, 2020, 10:07am

Seems that I found how the document which tells how they handle this.

Below is the link to the page that how peer go modify small sections from a large file.

hector · December 10, 2020, 10:08am

IPFS chunks your files into blocks and builds a Merkle-DAG to content-address them (https://www.youtube.com/watch?v=Z5zNPwMDYGg).

I don’t know what Peergos does, but “editing” a file efficiently would involve keeping track of which blocks you modified, and adjust all the DAG nodes/branches affected up to the root.

meowdada · December 10, 2020, 10:31am

Hi hector,

Thanks to your fast response.

If I understand it correctly, It is not yet available on go-ipfs to do such a modification, right ?
In current stage we can only re-upload the file unless we implement it by provided low level API. Which means that we still have to walk through the whole file and calculates the final hash value of the file.

hector · December 10, 2020, 11:04am

Well, you can mount IPFS-MFS as a filesystem and you can modify files there as you wish. Other than that, using the HTTP APIs, it would be moderately painful. Writing a program in Go to write some contents given a root hash and an offset might be the saner way of doing things.

wclayf · December 11, 2020, 3:20pm

Peergos says they’re chunking in 5MB blocks, and the challenge with any chunking algorithm is that if you insert (for example) one single byte at the very front of the file, then the hash of every chunk changes (all of them shifted by a byte), and destroys the ability to reuse chunks, and it’s the same effect as duplicating all the storage of an entire file (because you get all new chunks).

I think I read somewhere that RSYNC has some kind of intelligent way of choosing their chunks, that isn’t simply at the boundary line of every chunk block size (but uses the data itself to intelligently choose block boundaries), and is based on the known need to reuse blocks.

So the key question is can IPFS do this kind of intelligent chunk selection? …because if so then theoretically it could be efficient in modifying large files and accomplishing also the equivalent of RSYNC where only small data transfers are required to sync large files/folders.

hector · December 11, 2020, 4:24pm

Ipfs can use Rabin and buzhash chunkers (--chunker option in ipfs add). These are meant to do that: magically find the right block boundaries to increase deduplication if possible. Of course, depending on the input, they will work better or worse. When using them, blocks will not be of a fixed size as by default.

wclayf · December 11, 2020, 10:20pm

That’s awesome news to hear Hector! Thanks for clarifying. I was researching recently how MFS could be made to ‘simulate’ or ‘accomplish’ something approaching the efficiency of an rsync and so it’s good to know this might work well.

bam · March 28, 2021, 5:03pm

How do I do that? I could found nothing regarding it anywhere. Other than it currently unimplemented:

github.com/ipfs/kubo

Initial mount draft/WIP

ipfs:master ← djdv:feat/win-mount

opened 07:36PM - 01 Mar 19 UTC

djdv

+3914 -2621

Reference issue: https://github.com/ipfs/go-ipfs/issues/5003 Originally intende…d to be a read-only, Windows-only, variant, we now have an alternate implementation of `ipfs mount` that builds and runs on [multiple platforms](https://github.com/billziss-gh/cgofuse), and abstracts various APIs to expose an IPFS node as a virtual mutable filesystem. A demo of this is here: https://youtu.be/hh0epHWyt4I In this document I'll outline the expectations as well as the current faults of the implementation that need to be resolved before final. The daemon will expose a "mount-root" which will be referenced by name or `/` hereafter. The mount-root is a virtual filesystem that wraps IPFS APIs and exposes them via the fuse filesystem API. The APIs will be referenced as a "subsystem", while individual subsystems will be refereed to as "root"s. Paths beneath subsystems will be referenced generically as "node"s. At current, we expose the following subsystems at the following endpoints: /ipfs, /ipns, /files The ipfs-root wraps the pin api, to expose pins of the node as files. i.e. Readdir("/ipfs") would expose the following entries on stock node `QmS4ustL54uo8FzR9455qaxZwuMiUhyvMcX9Ba8nUH4uVv, QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn` ipfs-nodes beneath the ipfs-root map to various CoreAPI requests, exposing the ipfs-nodes as native file types with which read-only file system operations can be performed on. i.e. `Opendir("/ipfs/QmS4ustL54uo8FzR9455qaxZwuMiUhyvMcX9Ba8nUH4uVv")` `Open("/ipfs/QmS4ustL54uo8FzR9455qaxZwuMiUhyvMcX9Ba8nUH4uVv/about")` etc. ipfs-nodes do not have to be pins nor do they have to exist locally on the node's instance. If valid, the mount logic will attempt to fetch them from the network. The files-root wraps the MFS API and shares the same root as the `ipfs files` command. Exposing the same contents as `ipfs files ls /` and providing generic file system operations for its nodes as above, but also allowing for mutable operations as well. i.e. `Mknod("/files/sub-file")` `Mkdir("/files/sub-dir")` etc. The ipns-root wraps the same systems as the previous roots, as well as the key and name api's specifically. Exposing the node's keys `ipfs key list` as native file system types. As well as providing key management. i.e. `Mknod("/ipns/key-pointing-to-file")` `Mkidir("/ipns/key-pointing-to-directory")` `Rename("/ipns/key-old-name", "/ipns/key-new-name")` `Unlink("/ipns/key-name")` etc. ipns-nodes may be referenced via a named path or a CID path. Both of which are valid, but carry different operation access. Paths referenced via key's are resolved locally, and allow for mutable operations. i.e. both `/ipns/self` and `/ipns/QmTjr6BNmV6v3h7df3gbrNpsQLC5Y91csHWUE3QSYki2QK` are valid ipns-nodes but carry different bejhaviour. *** Draft / Implementation notes: Currently, most operations can be performed on most types, but the instance is somewhat fragile and not everything is functioning properly all the time. Effort is being put into simplifying the codebase and generalizing operation functions. Currently, paths are being passed in as strings and parsed via a lookup function `LookupPath()` into generic "fusePath" interface objects. Underneath, each node is hard typed, dependent on its path. These objects implement the CoreAPI `Path` type and are passed around to various operation functions. For example, you can derive a fuse compatible IO interface/object from a fusePath, or pass it to the CoreAPI if it's a valid global path. The current goal is to take the various procedural functions in the codebase and translate them into methods on the objects themselves. This should be a large reduction and more sensible in terms of object life-time/correctness. At current, the common patterns are a procedural and OO mix which should be made more consistent. Type switches are used extensively but it makes much more sense to just have `path.Read()` than `read(ipfs-path) { switch fusePath.(type) { ipfsRead(ipfs-path) }` It also makes more sense for the IO interface to live on the path, rather than just be associated with it through other means. There are a number of bad behaviours currently, but non are worth listing in particular. With the coupling of the nodes, the source of the existing issues should become more apparent or be fixed inherently. A particular example will be ASYNC IO. Currently nodes are very much independent, but the idea is to change how nodes are initialized, locked, and considerations about their lifetime. Simplifying things like this should inadvertently expose and/or fix issues around other operations. Very little of the config is wired up, things are mostly hard coded or placeholder. The worst offender being all call contexts sharing the context of the daemon, so doing something like calling read on a hash that has no providers will hang indefinitely (or until the OS kills it). Simply defining a sensible timeout context should remedy this. I have a pretty good understanding of what's wrong with this draft and what still needs to be done but criticism and listing problems are welcomed.

hector · March 29, 2021, 7:33am

I was wrong, ipfs mount mounts a read-only filesystem only, so interfacing with ipfs files commands might be the best.

bam · March 29, 2021, 11:26am

Thanks. By the way, I requested the write support status and here was it’s author @djdv answer:

Topic		Replies	Views
Efficiency of IPFS for sharing updated file go-ipfs	15	2267	July 24, 2019
Large files question about duplication Help	1	692	May 23, 2017
Does IPFS provide block-level file copying feature?	11	2294	January 9, 2020
Is it possible to resume upload? Kubo go-ipfs	4	575	December 7, 2020
IPFS / IPLD Diff Merge? Help	0	416	May 23, 2017

How to modify small sections of a large file stored in IPFS?

Related topics