Search code examples
pythonjsonxmlxmltodict

Converting multiple xml files/links to JSON using Python?


I know how to convert a single xml file or link to json in python using xmltodict. I was however wondering if there was any efficient way to convert multiple xml files(in order of hundreds or even thousand) to json in Python? Or, instead of Python, if there is any other tool better suited to it? Please note that I am not a very skilled programmer and have only used Python sporadically.


Solution

  • It depends on the specific case you are working on.

    My example case (for background):

    For instance, once I had to read data from a big set (1-million-word subcorpus) (around 2,6 GB) consisting of 3890 directories where there was an ann_morphosyntax.xml file in each one of them.

    A snippet from one of ann_morphosyntax.xml files for reference:

    <?xml version="1.0" encoding="UTF-8"?>
    <teiCorpus xmlns="http://www.tei-c.org/ns/1.0" xmlns:nkjp="http://www.nkjp.pl/ns/1.0" xmlns:xi="http://www.w3.org/2001/XInclude">
     <xi:include href="NKJP_1M_header.xml"/>
     <TEI>
      <xi:include href="header.xml"/>
      <text>
       <body>
        <p corresp="ann_segmentation.xml#segm_1-p" xml:id="morph_1-p">
         <s corresp="ann_segmentation.xml#segm_1.5-s" xml:id="morph_1.5-s">
          <seg corresp="ann_segmentation.xml#segm_1.1-seg" xml:id="morph_1.1-seg">
           <fs type="morph">
            <f name="orth">
             <string>Jest</string>
            </f>
    
    

    Every of those ann_morphosyntax.xml files contained one or more objects (let's say paragraphs for simplicity) that I needed to convert to JSON format each. Such paragraph object starts with <p in xml file snippet above.

    Additionally, there was also a need of keeping those JSONs in one file and decreasing the size of that file to the lowest possible, so I've decided to use JSONL format. This file format allows you to store every JSON as one line of that file without any spaces, which eventually let me decrease the size of the initial data set to around 450 MB.

    I've implemented a solution in Python 3.6. What I did is:

    1. I've used iglob to iterate through that directories in order to take ann_morphosyntax.xml file from each of them.
    2. To parse each ann_morphosyntax.xml file I've used the ElementTree library.
    3. I've saved those JSONs in output.jsonl file.

    Solution:

    To try this solution by yourself do as follows:

    1. Run this script to create two files in the output directory of the root directory of your project: example_1.xml and example_2.xml:
    import os
    import xml.etree.ElementTree as ET
    
    
    def prettify(element, indent='  '):
       queue = [(0, element)]  # (level, element)
       while queue:
           level, element = queue.pop(0)
           children = [(level + 1, child) for child in list(element)]
           if children:
               element.text = '\n' + indent * (level+1)  # for child open
           if queue:
               element.tail = '\n' + indent * queue[0][0]  # for sibling open
           else:
               element.tail = '\n' + indent * (level-1)  # for parent close
           queue[0:0] = children  # prepend so children come before siblings
    
    
    def _create_word_object(sentence_object, number, word_string):
       word = ET.SubElement(sentence_object, 'word', number=str(number))
       string = ET.SubElement(word, 'string', number=str(number))
       string.text = word_string
    
    
    def create_two_xml_files():
       xml_doc_1 = ET.Element('paragraph', number='1')
       xml_doc_2 = ET.Element('paragraph', number='1')
       sentence_1 = ET.SubElement(xml_doc_1, 'sentence', number='1')
       sentence_2 = ET.SubElement(xml_doc_2, 'sentence', number='1')
       _create_word_object(sentence_1, 1, 'This')
       _create_word_object(sentence_2, 1, 'This')
       _create_word_object(sentence_1, 2, 'is')
       _create_word_object(sentence_2, 2, 'is')
       _create_word_object(sentence_1, 3, 'first')
       _create_word_object(sentence_2, 3, 'second')
       _create_word_object(sentence_1, 4, 'example')
       _create_word_object(sentence_2, 4, 'example')
       _create_word_object(sentence_1, 5, 'sentence')
       _create_word_object(sentence_2, 5, 'sentence')
       _create_word_object(sentence_1, 6, '.')
       _create_word_object(sentence_2, 6, '.')
       prettify(xml_doc_1)
       prettify(xml_doc_2)
       tree_1 = ET.ElementTree(xml_doc_1)
       tree_2 = ET.ElementTree(xml_doc_2)
       os.mkdir('output')
       tree_1.write('output/example_1.xml', encoding='UTF-8', xml_declaration=True)
       tree_2.write('output/example_2.xml', encoding='UTF-8', xml_declaration=True)
    
    
    def main():
       create_two_xml_files()
    
    
    if __name__ == '__main__':
       main()
    
    
    1. Then run this script that will iterate through example_1.xml and example_2.xml files (using iglob) and create output.jsonl file (that will be saved in the root directory of your project) with data from two XML files created in the first step:
    import os
    import glob
    import errno
    import jsonlines
    import xml.etree.ElementTree as ET
    
    
    class Word:
        def __init__(self, word_id, word_text):
            self.word_id = word_id
            self.word_text = word_text
    
        def create_word_dict(self):
            return {"word": {"id": self.word_id, "text": self.word_text}}
    
    
    def parse_xml(file_path):
        for event, element in ET.iterparse(file_path, events=("start", "end",)):
            if event == "end":
                if element.tag == 'word':
                    yield Word(element[0].get('number'), element[0].text)
                    element.clear()
    
    
    def write_dicts_from_xmls_in_directory_to_jsonlines_file(parsing_generator):
        path = os.path.abspath(os.path.dirname(os.path.abspath(__file__))) + '/output/*'
        xml_files = glob.iglob(path)
        with jsonlines.open('output.jsonl', mode='a') as writer:
            for xml_file_name in xml_files:
                try:
                    with open(xml_file_name):
                        for next_word in parsing_generator(xml_file_name):
                            writer.write(next_word.create_word_dict())
                except IOError as exec:
                    if exec.errno != errno.EISDIR:
                        raise
    
    
    def main():
        write_dicts_from_xmls_in_directory_to_jsonlines_file(parse_xml)
    
    
    if __name__ == '__main__':
        main()
    
    

    The output.jsonl file will contain, in each line, a JSON object representing word element that can be found in example_1.xml and example_2.xml files generated in the first step.

    You can elaborate on that example and make it more suitable for your needs.

    P.S.

    The first script is based on post Pretty printing XML in Python