Search code examples
pythonxmlcelementtree

Create multiple nodes having the same name with sub nodes


I have a text file, I parsed it with python using xml.etree.cElementTree library. In the input I have a paragraph <p> containing sentences <s>, each sentence have words <w>, here is the text file what it looks like:

This
is
my
first
sentence.
This
is
my
second
sentence.

In the output I would like to have the following xml file:

<p>
   <s>
      <w>this</w>
      <w>is</w>
      <w>my</w>
      <w>first</w>
      <w>sentence</w>
      <pc>.</pc>
   </s>
   <s>
      <w>this</w>
      <w>is</w>
      <w>my</w>
      <w>second</w>
      <w>sentence</w>
      <pc>.</pc>
   </s>
</p>

I wrote the following python code that give me the paragraph tag and the word tag, and I don't know how to implement the case to have multiple <s> tag. A sentence start with capital letter and end with a dot. My python code:

source_file = open("file.txt", "r")
for line in source_file:
    # catch ponctuation : . and , and ! and ? and ()
    if re.match("(\(|\)|\.|\,|\!)", str(line)):
        ET.SubElement(p, "pc").text = line
    else:
        ET.SubElement(p, "w").text = line

tree.write("my_file.xml", encoding="UTF-8", xml_declaration=True)

following xml output:

<?xml version="1.0" encoding="UTF-8"?>
<p>
   <w>this</w>
   <w>is</w>
   <w>my</w>
   <w>first</w>
   <w>sentence</w>
   <pc>.</pc>
   <w>this</w>
   <w>is</w>
   <w>my</w>
   <w>second</w>
   <w>sentence</w>
   <pc>.</pc>
</p>

The problem I am facing is that I can't create a new <s> tag for every new sentence, is there a way to do that with the xml library using python ?


Solution

  • Basically you will need a logic to identify new sentence. Ignoring the obvious parts, something like below should do,

    import os
    eos = False
    s = ET.SubElement(p, 's')
    for line in source_file:
        line = str(line).rstrip(os.linesep) #to remove new line char at the end of each line
        # catch ponctuation : . and , and ! and ? and ()
        if re.match("(\(|\)|\.|\,|\!)", line):   #don't think this matches 'sentence.', you will need to verify
            ET.SubElement(s, "pc").text = line
            eos = True
        else:
            if eos and line.strip() and line[0].isupper():
                s = ET.SubElement(p, 's')
            eos = False
            ET.SubElement(s, "w").text = line
    

    Also, your regex might need a fix