Search code examples
pythonpandasdataframerounding

DataFrame Logic To Remove Rows Not Working


Background - I have a pandas DataFrame, which I have performed some math on, in order to calculate values to populate the Entity ID and % Ownership column with -

df['Entity ID %'] = df.groupby('Entity ID')['% Ownership'].transform(sum)
df['Account # %'] = df.groupby('Entity ID')['% Ownership'].transform(sum)

I then resave the DataFrame with only rows for those 2x columns whos value !=1 or !=0, as specified by this messy line of code. This forms my 'Exception Report' -

df = df[(df['Entity ID %'] != 1.0000000) & (df['Entity ID %'] != 0.0000000) & (df['Account # %'] != 1.0000000) & (df['Account # %'] != 0.0000000)]

I then request that both column's values are rounded to 7-decimal places -
df = df.round({'Entity ID %': 7, 'Account # %': 7})

Finally, the DataFrame is written to a .xlsx file -

with open(filename, 'w') as output_data:
        df.to_excel(filename+'.xlsx', sheet_name='Ownership Exception Report')

Issue - despite the code to resave the DataFrame with only rows that are != 1 or !=0, some rows that have 1 values in both columns still remain and are written to the xlsx, despite meeting the criteria to be removed.

Function 1 - this creates the Entity ID % and Account # % columns, as well as performing math, that will populate those columns -

def ownership_qc():
    df = unpack_response()
    df['Entity ID %'] = '-'
    df['Account # %'] = '-'
    df['Reason to Dismiss'] = '-'
    df['Entity ID %'] = df.groupby('Entity ID')['% Ownership'].transform(sum)
    df['Account # %'] = df.groupby('Entity ID')['% Ownership'].transform(sum)
    return df

Function 2 - this resaves the DataFrame with the criteria mentioned above (!=0 or !=1) before rounding to 7-decimal places, and finally writing to the .csv.

def ownership_exceptions():
    df = ownership_qc()
    df = df[(df['Entity ID %'] != 1.0000000) & (df['Entity ID %'] != 0.0000000) & (df['Account # %'] != 1.0000000) & (df['Account # %'] != 0.0000000)]
    df = df.round({'Entity ID %': 7, 'Account # %': 7})
    #   Counting rows in df
    index = df.index
    number_of_rows = len(index)
    timestr = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M")
    filename = 'ownership_exceptions_'+timestr
    with open(filename, 'w') as output_data:
        df.to_excel(filename+'.xlsx', sheet_name='Ownership Exception Report')
    print("---------------------------\n","EXCEPTION REPORT:", number_of_rows, "rows", "\n---------------------------")
    return df

Is anyone able to help me identify why some rows that meet the criteria to be removed, still remain in my DataFrame and thus still being written to the .xlsx file?

I am hoping my logic is flawed and it's an easy fix, however I have reached a dead-end troubleshooting this.


Solution

  • After having a close look at your code, I can see that you are initializing

     df['Entity ID %'] = '-'
     df['Account # %'] = '-'
    

    which makes them datatype object. You can not compare object with integers.

    Make the following change in your ownership_qc():

     df['Entity ID %'] = 0
     df['Account # %'] = 0
    

    That will help!