Search code examples
python-2.7parallel-processinggnu

GNU Parallel to run Python script on huge file


I have a file which contains XML elements in every line, which needs to be converted to JSON. I have written a Python script which does the conversion but runs in serial mode. I have two options to use Hadoop or GNU Parallel, I have tried Hadoop and want to see how GNU could help, will be simple for sure.

My Python code is as follows:

import sys import json import xmltodict with open('/path/sample.xml') as fd: for line in fd: o=xmltodict.parse(line) t=json.dumps(o) with open('sample.json', 'a') as out: out.write(t+ "\n") So can I use GNU parallel to directly work on the huge file or do I need to split it?

Or is this right: cat sample.xml | parallel python xmltojson.py >sample.json

Thanks


Solution

  • You need to change your Python code to a UNIX filter, i.e. a program that reads from standard input (stdin) and writes to standard output (stdout). Untested:

    import fileinput
    import sys
    import json
    import xmltodict
    
    for line in fileinput.input():
            o=xmltodict.parse(line)
            t=json.dumps(o)
            print t + "\n"
    

    Then you use --pipepart in GNU Parallel:

    parallel --pipepart -a sample.xml --block -1 python my_script.py