Search code examples
pythondataframe

Removing HTML formatting from column in dataframe


I have a dataframe where I need to remove the HTML tags and convert the data to just plain text.

I have found the following (Python code to remove HTML tags from a string):

import re
CLEANR = re.complile('<.*?>')
def cleanhtml(raw_html):
    cleantext = re.sub(CLEANR, '', rawhtml)
    return cleartext

I'm applying it to my column using:

df['col'] = df['col'].apply(cleanhtml(df['col']))

This caused an error as the 'col' was of the datatype Object, so I amended the function to convert the passed argument to a string, as follows:

import re
CLEANR = re.complile('<.*?>')
def cleanhtml(raw_html):
    cleantext = re.sub(CLEANR, '', str(rawhtml))
    return cleartext

The code still fails as it's receiving an object not string. The error is:

Name: col, Length: 1021, dtype: object' is not a valid function for series' object.

Can anyone nudge me in the right direction please? Thanks.


Solution

  • import re
    import pandas as pd
    
    raw_html = """<div>
    <h1>Title</h1>
    <p>A long text........ </p>
    <a href=""> a link </a>
    </div>"""
    
    tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')
    clean_html = lambda rawhtml: tag_re.sub('', str(rawhtml))
    df = pd.DataFrame({"col":[raw_html, raw_html]})
    html_to_text = [clean_html(h) for h in df.col]
    
    df.col = html_to_text
    print(df)
    

    Output:

    0    \nTitle\nA long text........ \n a link \n
    1    \nTitle\nA long text........ \n a link \n
    Name: col, dtype: object