Search code examples
pythonhtmlhtml-parsinginline-styles

Removing Specific Span Tags from a CSV file


I am trying to remove specific span tags from a csv file but my code is deleting all of them. I just need to point out certain ones to be removed for example '<span style="font-family: verdana,geneva; font-size: 10pt;">'. But some have '<b>' or '<p>' and or <STRONG> that bolds the text like <STRONG>name<\STRONG> that I need to keep. I want to remove the font family and font-size like stated above. How can this be done with python?

import re

CLEANR = re.compile('<.*?>')


def cleanhtml(raw_html):
    cleantext = re.sub(CLEANR, '', raw_html)
    return cleantext


a_file = open("file.csv", 'r')

lines = a_file.readlines()
a_file.close()

newfile = open("file2.csv", 'w')
for line in lines:
    line = cleanhtml(line)
    newfile.write(line)
newfile.close()

Solution

  • If your input is always HTML string, then you could use BeautifulSoup.

    Here is an example:

    from bs4 import BeautifulSoup
    
    doc = '''<span style="font-family: verdana,geneva; font-size: 10pt;"><b>xyz</b></span>'''
    soup = BeautifulSoup(doc, "html.parser")
    for tag in soup.recursiveChildGenerator():
        try:
            result = dict(filter(lambda elem: 'font-family' not in elem[1] and 'font-size' not in elem[1], tag.attrs.items()))
            tag.attrs = result
        except AttributeError:
            pass
    print(soup)
    

    The output:

    <span><b>xyz</b></span>
    

    So you can use this in your code like,

    from bs4 import BeautifulSoup
    
    def cleanhtml(raw_html):
        soup = BeautifulSoup(raw_html, "html.parser")
        for tag in soup.recursiveChildGenerator():
            try:
                result = dict(filter(lambda elem: 'font-family' not in elem[1] and 'font-size' not in elem[1], tag.attrs.items()))
                tag.attrs = result
            except AttributeError:
                pass
        return str(soup) #return as HTML string