amazon-s3 amazon-ec2 large-data large-files

Architecture to store and retrieve large files for web app

I have several users that work on large files (1GB). The files are simply large raw vectors of millions of points. It represents the acquisition of a signal for a long period of time.

I have a desktop software to visualize those data. Basically, I load the file, apply a filter (bandpass) and all the data and plot the vector.

What I would like to do is to visualize the data by parts on a web app. Chunks of data might not be so large so I dont have to load the whole file in the browser (I even dont know if it is possible). The files are stored on S3.

My question is then, how to efficiently store the files in order to be able to retrieve them quickly by parts. For example, a file has 100 millions of samples in it, but I just want to plot samples [125000, 150000]. How can I manage this without having to get the whole file from S3 to EC2 for example ? I thought of storing chunks of say 10000 samples of data so that I will have at most to get 3 files, but is that a good approach ?

Solution

Amazon S3 support reading parts of a file. Provided that you can calculate the offset to your desired data point and its length, you can read only that part.

This link shows how to do this with HTTP GET:

Get Object

And this page shows how to do this using AWS SDKs for various languages:

Getting Objects