Search code examples
pythonpandasdataframegroup-by

Merge rows that are mostly duplicate except for a column's binary value


I am cleaning a database of movies. It was formed by merging 4 CSVs (4 streaming services' movies) into one. There are some movies that exist on two or more streaming services, like both Prime and Hulu.

I was able to merge the rest of the columns with: movies.groupby(compareColumns, group_keys=False)[allColumns].apply(lambda x: x.ffill().bfill())

But now I'm left with rows that are practically identical except for their onPrime/onNetflix value (0=not available on service, 1=available on service).

For example, two rows I have are:

name onPrime onHulu otherColumn
Movie 1 1 0 X
Movie 1 0 1 X

How do I systematically merge the two rows to have desired output below? (I have other columns that I don't want to be impacted)

Desired output:

name onPrime onHulu otherColumn
Movie 1 1 1 X

Not sure how I could do it through sum, bfill, ffill, or any built-in function.

I tried filledgroups.fillna(value=0, axis=0, inplace=True, limit=1), where filledgroups is just a dataframe of two of the rows for trial, but it filled in 0s for other columns, whereas I only want to replace the 0s of onPrime/onHulu with 1s.


Solution

  • Grouping by name should do the trick..

    df_grouped = df.groupby('name').max().reset_index()
    

    With that approach, you group by name and aggregate using the max() function for all the columns.

    If you wanted to apply differents aggregations to other columns you could use agg():

    df_grouped = df.groupby('name').agg({'onPrime': 'max', 'onHulu': 'max', 'otherColumn': 'first'}).reset_index()