GNU Parallel to run Python script on huge file

I have a file which contains XML elements in every line, which needs to be converted to JSON. I have written a Python script which does the conversion but runs in serial mode. I have two options to use Hadoop or GNU Parallel, I have tried Hadoop and want to see how GNU could help, will be simple for sure.

My Python code is as follows:

import sys import json import xmltodict with open('/path/sample.xml') as fd: for line in fd: o=xmltodict.parse(line) t=json.dumps(o) with open('sample.json', 'a') as out: out.write(t+ "\n") So can I use GNU parallel to directly work on the huge file or do I need to split it?

Or is this right: cat sample.xml | parallel python xmltojson.py >sample.json

Thanks

Solution

You need to change your Python code to a UNIX filter, i.e. a program that reads from standard input (stdin) and writes to standard output (stdout). Untested:

import fileinput
import sys
import json
import xmltodict

for line in fileinput.input():
        o=xmltodict.parse(line)
        t=json.dumps(o)
        print t + "\n"

Then you use --pipepart in GNU Parallel:

parallel --pipepart -a sample.xml --block -1 python my_script.py