Removing HTML formatting from column in dataframe

I have a dataframe where I need to remove the HTML tags and convert the data to just plain text.

I have found the following (Python code to remove HTML tags from a string):

import re
CLEANR = re.complile('<.*?>')
def cleanhtml(raw_html):
    cleantext = re.sub(CLEANR, '', rawhtml)
    return cleartext

I'm applying it to my column using:

df['col'] = df['col'].apply(cleanhtml(df['col']))

This caused an error as the 'col' was of the datatype Object, so I amended the function to convert the passed argument to a string, as follows:

import re
CLEANR = re.complile('<.*?>')
def cleanhtml(raw_html):
    cleantext = re.sub(CLEANR, '', str(rawhtml))
    return cleartext

The code still fails as it's receiving an object not string. The error is:

Name: col, Length: 1021, dtype: object' is not a valid function for series' object.

Can anyone nudge me in the right direction please? Thanks.

Solution

import re
import pandas as pd

raw_html = """<div>
<h1>Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>"""

tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')
clean_html = lambda rawhtml: tag_re.sub('', str(rawhtml))
df = pd.DataFrame({"col":[raw_html, raw_html]})
html_to_text = [clean_html(h) for h in df.col]

df.col = html_to_text
print(df)

Output:

0    \nTitle\nA long text........ \n a link \n
1    \nTitle\nA long text........ \n a link \n
Name: col, dtype: object