Search code examples
pythonstringpandassplitdata-cleaning

Splitting DataFrame column from a referenced list of values


I've a pandas DataFrame (list of video games) with a column "classification". In that column, we can find:

  • simple classification: "RPG" or "Action"
  • multiple classifications: "Action Adventure RPG Roguelike", "Action Shoot'em Up Wargame"

You have noticed? There is no separator...

Of course, I need to split this in a new column, WITH separator (Or other structure with each separate element).

So

"Action Adventure RPG Roguelike" => "Action, Adventure, RPG, Roguelike"
"Action Shoot'em Up Wargame" => "Action, Shoot'em Up, Wargame"

I can't use space to split, nor Caps ("Shoot'em Up" is ONE value).

So, in my mind, I need to create a function to apply to this column, and check from a list of values (made by hand), find all of occurrence and return the string with separator...

Something like that:

classification = ["Action", "Adventure", "RPG", "Roguelike", "Shoot'em Up", "Wargame"...]

def magic_tric(data):
   # do the magic, comparing each classification possible / data
   return data_separated

But I do not know how to do it. In the most efficient way...

Can someone help me...? Thanks in advance.


Solution

  • here's an idea..using str.findall

                                    0
    0  Action Adventure RPG Roguelike
    1      Action Shoot'em Up Wargame
    
    sep = ["Action", "Adventure", "RPG", "Roguelike", "Shoot'em Up", "Wargame"]
    pattern = '|'.join(sep)
    
    
    pd.DataFrame(df[0].str.findall(pattern).tolist())
    

            0            1        2          3
    0  Action    Adventure      RPG  Roguelike
    1  Action  Shoot'em Up  Wargame       None