Search code examples
pythonpandasdataframeduplicates

Remove duplicate rows in DataFrame based on one column containing a substring


I have a dataframe like the following:

ID, Components
1,  "Room 1, ABC"
2,  "Room 2, ABC"
3,  "Room 3, DEF"
4,  "Room 1, DEF"
5,  "Room 3, DEF" 

I need to filter the dataframe so that there is only one row per room and the first occurrence of a given room is kept:

ID, Components
1,  "Room 1, ABC"
2,  "Room 2, ABC"
3,  "Room 3, DEF"

As shown above, we can see the rows with ID 4&5 have been removed as "Room 1" and "Room 3" is used in row's with IDs 1 and 3.

Alternatively, a count of unique rooms would also work, however the remainder of the string for Components can repeat, e.g. there can be numerous ABCs and DEFs but only 1 Room 1/2/3...

Therefore counting unique entries in the Components column will not work. It must be unique only for "Room n".


Solution

  • Input data:

    df = pd.DataFrame({'ID': np.arange(1, 6), 'Components': ['Room 1, ABC', 'Room 2, ABC', 'Room 3, DEF', 'Room 1, DEF', 'Room 3, DEF']})
    

    Use str.extract:

    df.loc[~df['Components'].str.extract('(\d+)').duplicated()]
    

    Output:

       ID   Components
    0   1  Room 1, ABC
    1   2  Room 2, ABC
    2   3  Room 3, DEF