Search code examples
sqldata-cleaning

SQL: exclude out certain rows in a messy dataset


I am cleaning a dataset using duckdb package in Python. The code is as follows:

dt = db.query(
    """
    select * 
    from dt_orig 
    where (A != '') and 
          (B != '' or B != 'Unknown' or B != 'Undefined') and
          (C != '') and
          (D != '' or D != '[UNKNOWN]')
    """).df()

However, when I check the unique levels of each variable (A, B, C, and D), some to-be-excluded-out rows are still there. The python code used to print the unique levels are as follows:

print("The unique A are", np.unique(dt['A']), ".\n")
print("The unique B are", np.unique(dt['B']), ".\n")
print("The unique C are", np.unique(dt['C']), ".\n")
print("The unique D are", np.unique(dt['D']), ".\n")

The output are as follows:

The unique A are ['A1' 'A2' 'A3' 'A4'] .

The unique B are ['' 'B1' 'Undefined' 'Unknown' 'B2'] .

The unique C are ['C1' 'C2' 'C3' 'C4'] .

The unique D are ['[UNKNOWN]' 'D1' 'D2' 'D3' 'D4'] .

Who can help? Thank you.


Solution

  • You have a logical error in your where condition:

    (B != '' or B != 'Unknown' or B != 'Undefined')
    

    Instead of this condition you should write:

    (B is not null and B != 'Unknown' and B != 'Undefined')
    

    or

    (B is not null and B not in ('Unknown', 'Undefined'))