S3 storage without duplication

From @urnix on Wed Apr 20 2016 09:53:08 GMT+0000 (UTC)

Hi guys.

The project on which I’m working on is a web application deployed on AWS EC2 and it working with a large number of images hosted on AWS S3.
I would use IPFS in this project, to quickly deliver the images to users. But there is one problem.
If I understand correctly, when I add files to IPFS, their contents copied to ~/.ipfs/blocks, occupying in this folder as much disk space as the source files. In my case, all the files (and they are many) that now lie on the S3 will be duplicated on EC2 - it does not suit me.
Question - S3 may be used as a repository of files and not duplicate the contents?
How I see it - on EC2 installed IPFS and my application and when IPFS-users request a specific file, its contents are taken from the S3.

I could not find the exact answer in the documentation. If somebody has same problem, how do you solved it?


Copied from original issue: https://github.com/ipfs/faq/issues/111

1 Like

From @jwcrawley on Tue Apr 26 2016 14:58:28 GMT+0000 (UTC)

Greetings,

I mentioned this on IRC when richardlitt called this to our attention.

With my experience in using ec2 and S3, I don’t think the s3 buckets would be a good fit for ipfs. Mainly because storing a file to be served goes from file(on s3)->s3 api->ec2-> served versus blocks(on s3)->s3 api (call per block)->ec2->served. It amounts to turning a single file call into a whole bunch of calls, for possibly unknown gain.

A possible (not yet done) way to solve this is to precompute the hashes for the files, and then store the raw files on the buckets. That way, machines speaking HTTP can still easily get the files, and the new IPFS clients you get can (at a slightly slower speed) get the content in the IPFS network.

My understanding how precompute is done is follows, and may help ferreting out my bad assumptions…
raw_file is broken up into 256KB chunks
each chunk has its hash associated to it
the root hash of a file has a list of all its ‘parts’ as a list of hashes

So a file 10M big would be broken up into 40 file chunks with 1 list chunk. (That’s why direct to S3 would not be good; a lot more API calls for the exact same content).