I have a text file, I parsed it with python using xml.etree.cElementTree
library.
In the input I have a paragraph <p>
containing sentences <s>
, each sentence have words <w>
, here is the text file what it looks like:
This
is
my
first
sentence.
This
is
my
second
sentence.
In the output I would like to have the following xml file:
<p>
<s>
<w>this</w>
<w>is</w>
<w>my</w>
<w>first</w>
<w>sentence</w>
<pc>.</pc>
</s>
<s>
<w>this</w>
<w>is</w>
<w>my</w>
<w>second</w>
<w>sentence</w>
<pc>.</pc>
</s>
</p>
I wrote the following python code that give me the paragraph tag and the word tag, and I don't know how to implement the case to have multiple <s>
tag. A sentence start with capital letter and end with a dot.
My python code:
source_file = open("file.txt", "r")
for line in source_file:
# catch ponctuation : . and , and ! and ? and ()
if re.match("(\(|\)|\.|\,|\!)", str(line)):
ET.SubElement(p, "pc").text = line
else:
ET.SubElement(p, "w").text = line
tree.write("my_file.xml", encoding="UTF-8", xml_declaration=True)
following xml output:
<?xml version="1.0" encoding="UTF-8"?>
<p>
<w>this</w>
<w>is</w>
<w>my</w>
<w>first</w>
<w>sentence</w>
<pc>.</pc>
<w>this</w>
<w>is</w>
<w>my</w>
<w>second</w>
<w>sentence</w>
<pc>.</pc>
</p>
The problem I am facing is that I can't create a new <s>
tag for every new sentence, is there a way to do that with the xml library using python ?
Basically you will need a logic to identify new sentence. Ignoring the obvious parts, something like below should do,
import os
eos = False
s = ET.SubElement(p, 's')
for line in source_file:
line = str(line).rstrip(os.linesep) #to remove new line char at the end of each line
# catch ponctuation : . and , and ! and ? and ()
if re.match("(\(|\)|\.|\,|\!)", line): #don't think this matches 'sentence.', you will need to verify
ET.SubElement(s, "pc").text = line
eos = True
else:
if eos and line.strip() and line[0].isupper():
s = ET.SubElement(p, 's')
eos = False
ET.SubElement(s, "w").text = line
Also, your regex might need a fix