I have a very large xml file on s3 (50gb). I would like to stream this file to a sax xml parser for further processing using ruby. How would I do that in an environment where I cannon download the whole file locally, but only stream it over tcp from s3?
I'm thinking about using https://github.com/ohler55/ox for the parsing it self, and https://github.com/aws/aws-sdk-ruby for accessing the file on S3. I'm just unsure how connect the pieces using a streaming approach?
The most easiest way is to use mc
. mc
implements are cat
command which can used in a simpler way.
For example as shown below. Here cat
streams your object and pipe the output of cat
to your XML parser which reads from stdinput.
$ mc cat s3.amazonaws.com/<yourbucket>/<yourobject> | <your_xml_parser>
This way you can avoid downloading the file locally.
Additionally mc
provides more tools to work with Amazon S3 compatible cloud storage and filesystems. It has features like resumable uploads, progress bar, parallel copy. mc
is written in Golang and released under Apache license v2. mc
is supported on OS X, Linux and Windows.