I have a dataframe like the following:
ID, Components
1, "Room 1, ABC"
2, "Room 2, ABC"
3, "Room 3, DEF"
4, "Room 1, DEF"
5, "Room 3, DEF"
I need to filter the dataframe so that there is only one row per room and the first occurrence of a given room is kept:
ID, Components
1, "Room 1, ABC"
2, "Room 2, ABC"
3, "Room 3, DEF"
As shown above, we can see the rows with ID 4&5 have been removed as "Room 1" and "Room 3" is used in row's with IDs 1 and 3.
Alternatively, a count of unique rooms would also work, however the remainder of the string for Components can repeat, e.g. there can be numerous ABCs and DEFs but only 1 Room 1/2/3...
Therefore counting unique entries in the Components column will not work. It must be unique only for "Room n".
Input data:
df = pd.DataFrame({'ID': np.arange(1, 6), 'Components': ['Room 1, ABC', 'Room 2, ABC', 'Room 3, DEF', 'Room 1, DEF', 'Room 3, DEF']})
Use str.extract
:
df.loc[~df['Components'].str.extract('(\d+)').duplicated()]
Output:
ID Components
0 1 Room 1, ABC
1 2 Room 2, ABC
2 3 Room 3, DEF