Search code examples
pythonstringtextreplacesplit

String cleaning removing consecutive value and put comma in the end


I have this string from an email I'm scraping:

TICKET\xa0\xa0 STATE\xa0\xa0\xa0\xa0 ACCOUNT IDENTIFIER\xa0\xa0\xa0 FILE DIRECTORY\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 CODE

My objective are the following:

  1. Remove \xa0
  2. Create comma separation for each group string

This is my ideal result:

TICKET,STATE,ACCOUNT IDENTIFIER,FILE DIRECTORY

On the other hand, here's what I ended up getting:

#code
my_string.replace(' ', ',').replace('\xa0', '')

#result
TICKET,STATE,ACCOUNT,IDENTIFIER,FILE,DIRECTORY

I was thinking of using regex however, I have no idea how I can implement the logic.


Solution

  • The relevant string separating the items you care about is \xa0, so you can split on that first and then just keep the elements which contain something other than just whitespace:

    my_string = "TICKET\xa0\xa0 STATE\xa0\xa0\xa0\xa0 ACCOUNT IDENTIFIER\xa0\xa0\xa0 FILE DIRECTORY\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 CODE"
    
    print(", ".join(x.strip() for x in my_string.split("\xa0") if x.strip()))
    # Output: TICKET, STATE, ACCOUNT IDENTIFIER, FILE DIRECTORY, CODE