I have identical samples with different labels and this has occurred due to either mislabeled data, If the data is mislabeled, it can confuse the model and can result in lower performance of the model.
It's a binary classification problem. if my input table is somethin like below
I want below table as my cleaned data
I tied this data cleaning library to check conflict but was not able to clean it :https://docs.deepchecks.com/stable/checks_gallery/tabular/data_integrity/plot_conflicting_labels.html#
and my custom function take lots of time to run, whats the most efficient way to run when i have 2M records to clean?
You can use drop_duplicates
with a subset:
out = df.drop_duplicates(['A', 'B', 'C'], ignore_index=True)
print(out)
# Output
A B C Target
0 1 2 3 0
1 2 8 9 1
2 9 6 5 1
3 3 7 0 0