I have a df column which has a lot of tags. I want to clean them out.
df['text_col'].tolist()
['n</li></ul>',
'<p> bla bla bla </p>',
'bla bla </b>, </li>, <li>, </ul>',
'bla bla <strong>bla </strong>: <h3> </h3>, <ul>,<b> </p>']
I see two ways of cleaning it.
I dont know of any other way other than str replace.. but it doesnt quite do what I explained above.
df["text_col"].str.replace("</p>"," ")
How do I remove all the tags and clean the text_col?
After a little bit of looking around this is what I found:
import re
x=['n</li></ul>',
'<p> bla bla bla </p>',
'bla bla </b>, </li>, <li>, </ul>',
'bla bla <strong>bla </strong>: <h3> </h3>, <ul>,<b> </p>']
for item in x:
item = re.sub("<.*?>|,|:", "", item)
item=' '.join(item.split())
print(item)
Outputs:
n
bla bla bla
bla bla
bla bla bla
I edited my answer again to refine it a little more. This should definitely answer your question. Thank regex :) .