Search code examples
pythonpandasdataframesplit

Split Pandas DF columns based on condition


I have the below DF I need to manipulate. The problem is the two tags columns can have either none, a Movie title or a Genre. I need to separate the Movie title and Genres into two new separate columns named Movie and Genre respectively and if either column contains none don't even evaluate it.

VideoId Tags1 Tags2
1234 None None
3456 Movie:Rocky Genre:Drama
5678 Genre:Romance Movie:Love Me
7890 Movie:Scream Genre:Horror
9012 Movie:Shrek None

So if columns contains Movie: then split the and put the second item of the array into the new Movie Column. If it contains Genre then split and place it in the new Genre column. I am getting an index out of range for the 'none' columns. And I think I have to reiterate thru each column again to find Genres that are in the Movies DF and vice-versa.

Movies = df['Tags1'].apply(lambda x: x.split(':')[1] if 'Movie' in x else x)

Genres = df['Tags2'].apply(lambda x: x.split(':')[1] if 'Genre' in x else x)

The resulting DF should look like below:

VideoId Movie Genre
1234 None None
3456 Rocky Drama
5678 Love Me Romance
7890 Scream Horror
9012 Shrek None

Solution

  • The suggested solution below uses .itertuples to go through all cells and splits any cells [unless they're in the first columns] containing : into a key-value pair, so it should work for any number of tag-types and Tags_ columns.

    # import pandas as pd
    # df = pd.read_html('https://stackoverflow.com/questions/75909759')[0] # from your q
    
    parse_cols = [c for c in df.columns if c[:4]=='Tags' or c=='VideoId'] # [filter]
    max_cols = 2 # None # [ set None to include all columns ]
    
    new_df = pd.DataFrame([{'VideoId':vi, **{k:v for k,v in [
        rv.split(':',1) for rv in row[:max_cols] if isinstance(rv,str) and ':' in rv
    ]}} for _, vi, *row in df[parse_cols].itertuples()])
    # ]}} for vi, *row in df.itertuples()]) ## if df['VideoId'] is also df.index
    

    NOTE: Omit the _, part in ...for _, vi, *row in... if df['VideoId'] is also df.index.

    new_df should look like

    ╒═══════════╤═════════╤═════════╕
    │   VideoId │ Movie   │ Genre   │
    ╞═══════════╪═════════╪═════════╡
    │      1234 │ None    │ None    │
    ├───────────┼─────────┼─────────┤
    │      3456 │ Rocky   │ Drama   │
    ├───────────┼─────────┼─────────┤
    │      5678 │ Love Me │ Romance │
    ├───────────┼─────────┼─────────┤
    │      7890 │ Scream  │ Horror  │
    ├───────────┼─────────┼─────────┤
    │      9012 │ Shrek   │ None    │
    ╘═══════════╧═════════╧═════════╛
    

    [Printed with print(new_df.fillna('None').to_markdown(index=False, tablefmt='fancy_grid')).]


    Btw, about

    "What happens if both Tags1, Tags2 column has Movie: or Genre: in the same row?" -- comment by SomeDude

    In the case of this solution, the value from last column [Tags2 in your sample data] will over-write any values with the same key from previous columns.