I've been trying to organize a text using Python but my attempt at using re.split
is not working, even if my regular expression is good (I've tried it on notepad++).
I need to split my text using the regular expression (and keep what has been found) but the text is being split caracter by caracter.
texttag is a txt file that looks like this :
<word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>' '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>' '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>
And i'm trying to split the
<word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
and i'm trying to split and tag it in such a manner :
<chap1>
<head><word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
</head>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>' '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>' '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>
</chap>
here is my whole code for now :
Dumas_XML=open("D:/cours/M1/S2/PTpython/GitHub/test2.txt","a") #C:/Users/super/Desktop/PTpython/GitHub/
#puverture du header xml
Dumas_XML.write('<?xml version="1.0" encoding="UTF-8"?>\n')
Dumas_XML.write('<Doc name="DUMAS" path="C:/Users/super/Desktop/PTpython/GitHub/textes"></Doc>\n') |6
Dumas_XML.write('<Document num="1" taille= "nombre de mots int()"/> </Document> \n ')
filetag = open("D:/cours/M1/S2/PTpython/GitHub/wordtag.txt")
import re
texttag= filetag.read()
regextag ="(<word>'CHAP'</word><pos> '[A-Z]{2,5}'</pos>\r\n<word>'.'</word><pos> 'PUNCT'</pos>\r\n<word>'[A-Z]{1,7}'</word><pos> '[A-Z]{1,7}'</pos>)"
xx=re.split(regextag, texttag)
compteurchap=0
for chap in xx :
if re.search(regextag, chap) :
compteurchap=compteurchap+1
Dumas_XML.write("<chap"+str(compteurchap)+">\n")
print("<head>"+chap+"</head>")
Dumas_XML.write("<head>"+chap+"</head>")
#else:
Dumas_XML.write(chap)
Dumas_XML.write("</chap>\n")
How can I do this correctly?
If you must use regex then this could be an option:
import re
pattern1 = re.compile(r"<word>.*?'NOUN'</pos>",re.MULTILINE | re.DOTALL)
pattern2 = re.compile(r"'NOUN'</pos>(.*)$", re.MULTILINE |re.DOTALL)
reobj = pattern1.search(texttag)
text = "<chap1>\n<head>"
text += reobj.group() + "\n</head>\n"
text += pattern2.findall(texttag)[0]
text += "\n</chap>\n"
print(text)
Dumas_XML.write(text)
output:
<chap1>
<head><word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
</head>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>' '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>' '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>
</chap>
Is that close to what you are looking for?