Search code examples
pythonpython-3.xpython-re

Regular expression in Python and re.split splitting the wrong thing


I've been trying to organize a text using Python but my attempt at using re.split is not working, even if my regular expression is good (I've tried it on notepad++).

I need to split my text using the regular expression (and keep what has been found) but the text is being split caracter by caracter.

texttag is a txt file that looks like this :

<word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'                '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'    '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>

And i'm trying to split the

<word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>

and i'm trying to split and tag it in such a manner :

<chap1>
<head><word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
</head>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'                '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'    '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>
</chap>

here is my whole code for now :

Dumas_XML=open("D:/cours/M1/S2/PTpython/GitHub/test2.txt","a") #C:/Users/super/Desktop/PTpython/GitHub/
#puverture du header xml
Dumas_XML.write('<?xml version="1.0" encoding="UTF-8"?>\n')
Dumas_XML.write('<Doc name="DUMAS" path="C:/Users/super/Desktop/PTpython/GitHub/textes"></Doc>\n') |6
Dumas_XML.write('<Document num="1" taille= "nombre de mots int()"/> </Document> \n ')

filetag = open("D:/cours/M1/S2/PTpython/GitHub/wordtag.txt")

import re
texttag= filetag.read()

regextag ="(<word>'CHAP'</word><pos> '[A-Z]{2,5}'</pos>\r\n<word>'.'</word><pos> 'PUNCT'</pos>\r\n<word>'[A-Z]{1,7}'</word><pos> '[A-Z]{1,7}'</pos>)"

xx=re.split(regextag, texttag)

compteurchap=0
for chap in xx :
    if re.search(regextag, chap) : 
        compteurchap=compteurchap+1
        Dumas_XML.write("<chap"+str(compteurchap)+">\n")
        print("<head>"+chap+"</head>")
        Dumas_XML.write("<head>"+chap+"</head>")
    #else:
        Dumas_XML.write(chap)
        Dumas_XML.write("</chap>\n")

How can I do this correctly?


Solution

  • If you must use regex then this could be an option:

    import re
    
    
    pattern1 = re.compile(r"<word>.*?'NOUN'</pos>",re.MULTILINE | re.DOTALL)
    pattern2 = re.compile(r"'NOUN'</pos>(.*)$", re.MULTILINE |re.DOTALL)
    
    reobj = pattern1.search(texttag)
    
    text = "<chap1>\n<head>"
    text += reobj.group() + "\n</head>\n"
    text += pattern2.findall(texttag)[0]
    text += "\n</chap>\n"
    print(text)
    Dumas_XML.write(text)
    

    output:

    <chap1>
    <head><word>'CHAP'</word><pos> 'ADJ'</pos>
    <word>'.'</word><pos> 'PUNCT'</pos>
    <word>'Ier'</word><pos> 'NOUN'</pos>
    </head>
    <word>'\n'</word><pos> 'SPACE'</pos>
    <word>'                '</word><pos> 'SPACE'</pos>
    <word>'Marseille'</word><pos> 'PROPN'</pos>
    <word>'.'</word><pos> 'PUNCT'</pos>
    <word>'L’'</word><pos> 'PROPN'</pos>
    <word>'arrivée'</word><pos> 'NOUN'</pos>
    <word>'\n'</word><pos> 'SPACE'</pos>
    <word>'\n'</word><pos> 'SPACE'</pos>
    <word>'    '</word><pos> 'SPACE'</pos>
    <word>'Le'</word><pos> 'DET'</pos>
    <word>'24'</word><pos> 'NUM'</pos>
    <word>'février'</word><pos> 'NOUN'</pos>
    <word>'1815'</word><pos> 'NUM'</pos>
    </chap>
    

    Is that close to what you are looking for?