I have the below DF I need to manipulate. The problem is the two tags columns can have either none, a Movie title or a Genre. I need to separate the Movie title and Genres into two new separate columns named Movie and Genre respectively and if either column contains none don't even evaluate it.
VideoId | Tags1 | Tags2 |
---|---|---|
1234 | None | None |
3456 | Movie:Rocky | Genre:Drama |
5678 | Genre:Romance | Movie:Love Me |
7890 | Movie:Scream | Genre:Horror |
9012 | Movie:Shrek | None |
So if columns contains Movie: then split the and put the second item of the array into the new Movie Column. If it contains Genre then split and place it in the new Genre column. I am getting an index out of range for the 'none' columns. And I think I have to reiterate thru each column again to find Genres that are in the Movies DF and vice-versa.
Movies = df['Tags1'].apply(lambda x: x.split(':')[1] if 'Movie' in x else x)
Genres = df['Tags2'].apply(lambda x: x.split(':')[1] if 'Genre' in x else x)
The resulting DF should look like below:
VideoId | Movie | Genre |
---|---|---|
1234 | None | None |
3456 | Rocky | Drama |
5678 | Love Me | Romance |
7890 | Scream | Horror |
9012 | Shrek | None |
The suggested solution below uses .itertuples
to go through all cells and splits any cells [unless they're in the first columns] containing :
into a key-value pair, so it should work for any number of tag-types and Tags_
columns.
# import pandas as pd
# df = pd.read_html('https://stackoverflow.com/questions/75909759')[0] # from your q
parse_cols = [c for c in df.columns if c[:4]=='Tags' or c=='VideoId'] # [filter]
max_cols = 2 # None # [ set None to include all columns ]
new_df = pd.DataFrame([{'VideoId':vi, **{k:v for k,v in [
rv.split(':',1) for rv in row[:max_cols] if isinstance(rv,str) and ':' in rv
]}} for _, vi, *row in df[parse_cols].itertuples()])
# ]}} for vi, *row in df.itertuples()]) ## if df['VideoId'] is also df.index
NOTE: Omit the _,
part in ...for _, vi, *row in...
if df['VideoId']
is also df.index
.
new_df
should look like
╒═══════════╤═════════╤═════════╕ │ VideoId │ Movie │ Genre │ ╞═══════════╪═════════╪═════════╡ │ 1234 │ None │ None │ ├───────────┼─────────┼─────────┤ │ 3456 │ Rocky │ Drama │ ├───────────┼─────────┼─────────┤ │ 5678 │ Love Me │ Romance │ ├───────────┼─────────┼─────────┤ │ 7890 │ Scream │ Horror │ ├───────────┼─────────┼─────────┤ │ 9012 │ Shrek │ None │ ╘═══════════╧═════════╧═════════╛
[Printed with
print(new_df.fillna('None').to_markdown(index=False, tablefmt='fancy_grid'))
.]
Btw, about
"What happens if both Tags1, Tags2 column has Movie: or Genre: in the same row?" -- comment by SomeDude
In the case of this solution, the value from last column [Tags2
in your sample data] will over-write any values with the same key from previous columns.