I am trying to remove specific span tags from a csv file but my code is deleting all of them. I just need to point out certain ones to be removed for example '<span style="font-family: verdana,geneva; font-size: 10pt;">'
. But some have '<b>'
or '<p>'
and or <STRONG>
that bolds the text like <STRONG>name<\STRONG>
that I need to keep. I want to remove the font family and font-size like stated above. How can this be done with python?
import re
CLEANR = re.compile('<.*?>')
def cleanhtml(raw_html):
cleantext = re.sub(CLEANR, '', raw_html)
return cleantext
a_file = open("file.csv", 'r')
lines = a_file.readlines()
a_file.close()
newfile = open("file2.csv", 'w')
for line in lines:
line = cleanhtml(line)
newfile.write(line)
newfile.close()
If your input is always HTML string, then you could use BeautifulSoup
.
Here is an example:
from bs4 import BeautifulSoup
doc = '''<span style="font-family: verdana,geneva; font-size: 10pt;"><b>xyz</b></span>'''
soup = BeautifulSoup(doc, "html.parser")
for tag in soup.recursiveChildGenerator():
try:
result = dict(filter(lambda elem: 'font-family' not in elem[1] and 'font-size' not in elem[1], tag.attrs.items()))
tag.attrs = result
except AttributeError:
pass
print(soup)
The output:
<span><b>xyz</b></span>
So you can use this in your code like,
from bs4 import BeautifulSoup
def cleanhtml(raw_html):
soup = BeautifulSoup(raw_html, "html.parser")
for tag in soup.recursiveChildGenerator():
try:
result = dict(filter(lambda elem: 'font-family' not in elem[1] and 'font-size' not in elem[1], tag.attrs.items()))
tag.attrs = result
except AttributeError:
pass
return str(soup) #return as HTML string