Search code examples
pythonstringdataframedata-cleaning

df column replace with space in anything that falls in a symbol


I have a df column which has a lot of tags. I want to clean them out.

df['text_col'].tolist()


['n</li></ul>',
 '<p> bla bla bla </p>',
 'bla bla </b>, </li>, <li>, </ul>',
 'bla bla <strong>bla </strong>: <h3> </h3>, <ul>,<b> </p>']

I see two ways of cleaning it.

  1. Create a list of all tags I find in the text and then replace those with the empty string '' (can be laborious task to maintain the list)
  2. Some logic to remove anything that comes in < and > tags.

I dont know of any other way other than str replace.. but it doesnt quite do what I explained above.

df["text_col"].str.replace("</p>"," ")

How do I remove all the tags and clean the text_col?


Solution

  • After a little bit of looking around this is what I found:

    import re
    
    x=['n</li></ul>',
     '<p> bla bla bla </p>',
     'bla bla </b>, </li>, <li>, </ul>',
     'bla bla <strong>bla </strong>: <h3> </h3>, <ul>,<b> </p>']
    
    for item in x:
        item = re.sub("<.*?>|,|:", "", item)
        item=' '.join(item.split())
        print(item)
    

    Outputs:

    n
    bla bla bla
    bla bla
    bla bla bla
    

    I edited my answer again to refine it a little more. This should definitely answer your question. Thank regex :) .