I am cleaning a dataset using duckdb package in Python. The code is as follows:
dt = db.query(
"""
select *
from dt_orig
where (A != '') and
(B != '' or B != 'Unknown' or B != 'Undefined') and
(C != '') and
(D != '' or D != '[UNKNOWN]')
""").df()
However, when I check the unique levels of each variable (A, B, C, and D), some to-be-excluded-out rows are still there. The python code used to print the unique levels are as follows:
print("The unique A are", np.unique(dt['A']), ".\n")
print("The unique B are", np.unique(dt['B']), ".\n")
print("The unique C are", np.unique(dt['C']), ".\n")
print("The unique D are", np.unique(dt['D']), ".\n")
The output are as follows:
The unique A are ['A1' 'A2' 'A3' 'A4'] .
The unique B are ['' 'B1' 'Undefined' 'Unknown' 'B2'] .
The unique C are ['C1' 'C2' 'C3' 'C4'] .
The unique D are ['[UNKNOWN]' 'D1' 'D2' 'D3' 'D4'] .
Who can help? Thank you.
You have a logical error in your where condition:
(B != '' or B != 'Unknown' or B != 'Undefined')
Instead of this condition you should write:
(B is not null and B != 'Unknown' and B != 'Undefined')
or
(B is not null and B not in ('Unknown', 'Undefined'))