I know how to convert a single xml file or link to json in python using xmltodict. I was however wondering if there was any efficient way to convert multiple xml files(in order of hundreds or even thousand) to json in Python? Or, instead of Python, if there is any other tool better suited to it? Please note that I am not a very skilled programmer and have only used Python sporadically.
It depends on the specific case you are working on.
My example case (for background):
For instance, once I had to read data from a big set (1-million-word subcorpus) (around 2,6 GB) consisting of 3890 directories where there was an ann_morphosyntax.xml file in each one of them.
A snippet from one of ann_morphosyntax.xml files for reference:
<?xml version="1.0" encoding="UTF-8"?>
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0" xmlns:nkjp="http://www.nkjp.pl/ns/1.0" xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:include href="NKJP_1M_header.xml"/>
<TEI>
<xi:include href="header.xml"/>
<text>
<body>
<p corresp="ann_segmentation.xml#segm_1-p" xml:id="morph_1-p">
<s corresp="ann_segmentation.xml#segm_1.5-s" xml:id="morph_1.5-s">
<seg corresp="ann_segmentation.xml#segm_1.1-seg" xml:id="morph_1.1-seg">
<fs type="morph">
<f name="orth">
<string>Jest</string>
</f>
Every of those ann_morphosyntax.xml files contained one or more objects (let's say paragraphs for simplicity) that I needed to convert to JSON format each.
Such paragraph object starts with <p
in xml file snippet above.
Additionally, there was also a need of keeping those JSONs in one file and decreasing the size of that file to the lowest possible, so I've decided to use JSONL format. This file format allows you to store every JSON as one line of that file without any spaces, which eventually let me decrease the size of the initial data set to around 450 MB.
I've implemented a solution in Python 3.6. What I did is:
Solution:
To try this solution by yourself do as follows:
import os
import xml.etree.ElementTree as ET
def prettify(element, indent=' '):
queue = [(0, element)] # (level, element)
while queue:
level, element = queue.pop(0)
children = [(level + 1, child) for child in list(element)]
if children:
element.text = '\n' + indent * (level+1) # for child open
if queue:
element.tail = '\n' + indent * queue[0][0] # for sibling open
else:
element.tail = '\n' + indent * (level-1) # for parent close
queue[0:0] = children # prepend so children come before siblings
def _create_word_object(sentence_object, number, word_string):
word = ET.SubElement(sentence_object, 'word', number=str(number))
string = ET.SubElement(word, 'string', number=str(number))
string.text = word_string
def create_two_xml_files():
xml_doc_1 = ET.Element('paragraph', number='1')
xml_doc_2 = ET.Element('paragraph', number='1')
sentence_1 = ET.SubElement(xml_doc_1, 'sentence', number='1')
sentence_2 = ET.SubElement(xml_doc_2, 'sentence', number='1')
_create_word_object(sentence_1, 1, 'This')
_create_word_object(sentence_2, 1, 'This')
_create_word_object(sentence_1, 2, 'is')
_create_word_object(sentence_2, 2, 'is')
_create_word_object(sentence_1, 3, 'first')
_create_word_object(sentence_2, 3, 'second')
_create_word_object(sentence_1, 4, 'example')
_create_word_object(sentence_2, 4, 'example')
_create_word_object(sentence_1, 5, 'sentence')
_create_word_object(sentence_2, 5, 'sentence')
_create_word_object(sentence_1, 6, '.')
_create_word_object(sentence_2, 6, '.')
prettify(xml_doc_1)
prettify(xml_doc_2)
tree_1 = ET.ElementTree(xml_doc_1)
tree_2 = ET.ElementTree(xml_doc_2)
os.mkdir('output')
tree_1.write('output/example_1.xml', encoding='UTF-8', xml_declaration=True)
tree_2.write('output/example_2.xml', encoding='UTF-8', xml_declaration=True)
def main():
create_two_xml_files()
if __name__ == '__main__':
main()
import os
import glob
import errno
import jsonlines
import xml.etree.ElementTree as ET
class Word:
def __init__(self, word_id, word_text):
self.word_id = word_id
self.word_text = word_text
def create_word_dict(self):
return {"word": {"id": self.word_id, "text": self.word_text}}
def parse_xml(file_path):
for event, element in ET.iterparse(file_path, events=("start", "end",)):
if event == "end":
if element.tag == 'word':
yield Word(element[0].get('number'), element[0].text)
element.clear()
def write_dicts_from_xmls_in_directory_to_jsonlines_file(parsing_generator):
path = os.path.abspath(os.path.dirname(os.path.abspath(__file__))) + '/output/*'
xml_files = glob.iglob(path)
with jsonlines.open('output.jsonl', mode='a') as writer:
for xml_file_name in xml_files:
try:
with open(xml_file_name):
for next_word in parsing_generator(xml_file_name):
writer.write(next_word.create_word_dict())
except IOError as exec:
if exec.errno != errno.EISDIR:
raise
def main():
write_dicts_from_xmls_in_directory_to_jsonlines_file(parse_xml)
if __name__ == '__main__':
main()
The output.jsonl file will contain, in each line, a JSON object representing word element that can be found in example_1.xml and example_2.xml files generated in the first step.
You can elaborate on that example and make it more suitable for your needs.
P.S.
The first script is based on post Pretty printing XML in Python