Search code examples
rubyamazon-s3streaminglarge-filessaxparser

Sax parsing a large file from S3


I have a very large xml file on s3 (50gb). I would like to stream this file to a sax xml parser for further processing using ruby. How would I do that in an environment where I cannon download the whole file locally, but only stream it over tcp from s3?

I'm thinking about using https://github.com/ohler55/ox for the parsing it self, and https://github.com/aws/aws-sdk-ruby for accessing the file on S3. I'm just unsure how connect the pieces using a streaming approach?


Solution

  • The most easiest way is to use mc. mc implements are cat command which can used in a simpler way.

    For example as shown below. Here cat streams your object and pipe the output of cat to your XML parser which reads from stdinput.

    $ mc cat s3.amazonaws.com/<yourbucket>/<yourobject> | <your_xml_parser> 
    

    This way you can avoid downloading the file locally.

    Additionally mc provides more tools to work with Amazon S3 compatible cloud storage and filesystems. It has features like resumable uploads, progress bar, parallel copy. mc is written in Golang and released under Apache license v2. mc is supported on OS X, Linux and Windows.