Search code examples
pythonpython-re

Erases the following tag while trying to replace a html tag using Python


I have a word doc of following format

test1.docx

 ["<h58>This is article|", "", ", "<s1>Author is|", "<s33>Research is on|", "<h4>CASE IS|", "<s6>1-3|"]

Tried to locate tag starting with <s.*?> and replace the tag and its contents with ""

def locatestag():
 fileis = open("test1.docx")

 for line in fileis:
   print(line)
   newfile = re.sub('<s.*?> .*? ','',line)

with open("new file.json","w") as newoutput:
   son.dump(newfile, newoutput)

The final output file also makes tag like disapper.

The final contents is like

["<h58>This is article|", "", ", ]

How to remove <s.*> and its contents only while retaining rest of the tag (ie retain tag )


Solution

  • Adjust your regex so that only the <s.*> tags and their contents are matched.

    for line in fileis:
        print(line)
        newfile = re.sub('<s[^>]*>[^"]*(?=")','',line)
    

    The resulting newfile will be:

    ["<h58>This is article|", "", ", "", "", "<h4>CASE IS|", ""]
    

    Of course this assumes that the "unclosed" double quote in what looks like an array of strings is not a typo but intentionally the content of your file.