I have a word doc of following format
test1.docx
["<h58>This is article|", "", ", "<s1>Author is|", "<s33>Research is on|", "<h4>CASE IS|", "<s6>1-3|"]
Tried to locate tag starting with <s.*?> and replace the tag and its contents with ""
def locatestag():
fileis = open("test1.docx")
for line in fileis:
print(line)
newfile = re.sub('<s.*?> .*? ','',line)
with open("new file.json","w") as newoutput:
son.dump(newfile, newoutput)
The final output file also makes tag like disapper.
The final contents is like
["<h58>This is article|", "", ", ]
How to remove <s.*> and its contents only while retaining rest of the tag (ie retain tag )
Adjust your regex so that only the <s.*>
tags and their contents are matched.
for line in fileis:
print(line)
newfile = re.sub('<s[^>]*>[^"]*(?=")','',line)
The resulting newfile
will be:
["<h58>This is article|", "", ", "", "", "<h4>CASE IS|", ""]
Of course this assumes that the "unclosed" double quote in what looks like an array of strings is not a typo but intentionally the content of your file.