python pandas machine-learning classification data-cleaning

how can I remove Label conflict in classification problem?

I have identical samples with different labels and this has occurred due to either mislabeled data, If the data is mislabeled, it can confuse the model and can result in lower performance of the model.

It's a binary classification problem. if my input table is somethin like below

I want below table as my cleaned data

I tied this data cleaning library to check conflict but was not able to clean it :https://docs.deepchecks.com/stable/checks_gallery/tabular/data_integrity/plot_conflicting_labels.html#

and my custom function take lots of time to run, whats the most efficient way to run when i have 2M records to clean?

Solution

You can use drop_duplicates with a subset:

out = df.drop_duplicates(['A', 'B', 'C'], ignore_index=True)
print(out)

# Output
   A  B  C  Target
0  1  2  3       0
1  2  8  9       1
2  9  6  5       1
3  3  7  0       0