I have a dataframe where I need to remove the HTML tags and convert the data to just plain text.
I have found the following (Python code to remove HTML tags from a string):
import re
CLEANR = re.complile('<.*?>')
def cleanhtml(raw_html):
cleantext = re.sub(CLEANR, '', rawhtml)
return cleartext
I'm applying it to my column using:
df['col'] = df['col'].apply(cleanhtml(df['col']))
This caused an error as the 'col' was of the datatype Object, so I amended the function to convert the passed argument to a string, as follows:
import re
CLEANR = re.complile('<.*?>')
def cleanhtml(raw_html):
cleantext = re.sub(CLEANR, '', str(rawhtml))
return cleartext
The code still fails as it's receiving an object not string. The error is:
Name: col, Length: 1021, dtype: object' is not a valid function for series' object.
Can anyone nudge me in the right direction please? Thanks.
import re
import pandas as pd
raw_html = """<div>
<h1>Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>"""
tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')
clean_html = lambda rawhtml: tag_re.sub('', str(rawhtml))
df = pd.DataFrame({"col":[raw_html, raw_html]})
html_to_text = [clean_html(h) for h in df.col]
df.col = html_to_text
print(df)
Output:
0 \nTitle\nA long text........ \n a link \n
1 \nTitle\nA long text........ \n a link \n
Name: col, dtype: object