I have a file which contains XML elements in every line, which needs to be converted to JSON. I have written a Python script which does the conversion but runs in serial mode. I have two options to use Hadoop or GNU Parallel, I have tried Hadoop and want to see how GNU could help, will be simple for sure.
My Python code is as follows:
import sys
import json
import xmltodict
with open('/path/sample.xml') as fd:
for line in fd:
o=xmltodict.parse(line)
t=json.dumps(o)
with open('sample.json', 'a') as out:
out.write(t+ "\n")
So can I use GNU parallel to directly work on the huge file or do I need to split it?
Or is this right:
cat sample.xml | parallel python xmltojson.py >sample.json
Thanks
You need to change your Python code to a UNIX filter, i.e. a program that reads from standard input (stdin) and writes to standard output (stdout). Untested:
import fileinput
import sys
import json
import xmltodict
for line in fileinput.input():
o=xmltodict.parse(line)
t=json.dumps(o)
print t + "\n"
Then you use --pipepart
in GNU Parallel:
parallel --pipepart -a sample.xml --block -1 python my_script.py