Search code examples
postgresqlpandasblank-line

Blank rows occuring despite use of NOT NULL and <> ''


I'm trying to remove all empty/blank cells from my table. However, I have some blank cells remaining even after I attempt to remove them with the aforementioned methods in the title.

I've tried NOT NULL and <> '', likewise, I've tried >0. None of these seem to remove the blank cells. I'm not sure what other type it could be. The columns are varchar so it's hard to identify what it is.

Seemingly, nobody has encountered this, as I haven't been able to find any similar articles or issues. This table is an incredible mess as there are glaring inconsistencies everywhere.

The statement I am using:

SELECT * FROM table WHERE column is NOT NULL AND column <> ''

Ideally, all of the blank cells would be gone so I could insure that my Pandas df is accurate.

My code in Python is locating about 2,000 "null" entries in the table:

def enumerate_null_data(df):
    #pandas doesn't support blank strings or None distinguishments with isnull/isna, so we replace those with np.NaN
    #a data type that is consistent with its archictecture/is handled properly
    df['rfid_sent'].replace(['', None], np.nan, inplace=True)
    df['rfid_received'].replace(['', None], np.nan, inplace=True)
    #dataframe that no longer contains the null values
    sent_null_removed = df.dropna(subset=['rfid_sent'])
    received_null_removed = df.dropna(subset=['rfid_received'])

    #create a dataframe that has all of the entries that were removed from sent_null_removed/received_null_removed
    #and count them (get the length of that dataframe)
    num_sent_null_removed = len(df[~df.index.isin(sent_null_removed.index)].index)
    num_received_null_removed = len(df[~df.index.isin(received_null_removed.index)].index)


    # dataframe containing only the values that were null/NA
    na_only = df[~df.index.isin(sent_null_removed.index) | ~df.index.isin(received_null_removed.index)]

    return (na_only, num_sent_null_removed, num_received_null_removed)

I honestly have no idea what else to try. Is there some "Empty" format I am missing here? Pandas recognized the blank cells as: '', Empty, None and np.nan. Yes, the whole variety. :S


Solution

  • The solution to this problem had to do with Pattern Recognition using Regex. Particularly, I attempted a large number of solutions that filtered out any form of erroneous data types that would dirty the data table. Unfortunately, the solution was not so simple.

    However, the data (that I am anticipating/want to remain) is highly consistent in structure, and so I just used RegEx to filter out any such data entries that did not align with my expectations.