Search code examples
pythonpandasdataframedata-cleaning

How to remove error values in large df with 1000 columns


I have a large dataset with more than 1000 columns, the dataset is messy with mixed dtypes. There are 2 int64 columns, 119 float columns and 1266 object columns.

I would like to begin data cleaning but realised there are several issues. As there are too many columns, visual inspection of the data to locate errors is too tedious. An sample of the dataset is below

Company ID  Year    Date         Actual Loan Loss  Depreciation          Accounts Payable
001         2001    19 Oct 2001  100000.00         40000                 $$ER: 4540,NO DATA VALUES FOUND
002         2002    18 Sept 2001 NaN               $$ER: E100,NO WORLDSCOPE DATA FOR THIS CODE
003         2004    01 Aug 2000  145000.00         5000                  Finance Dept

I would like to remove all the error variables before dropping the null rows. The error variables typically start with "$$ER:"

I've tried the following

#load the dataset
df = pd.read_excel("path/file1.xlsx", sheet_name = "DATA_TS")
#examine the data
df.head(20)
#check number of rows, cols and dtypes
df.info()

#create a function to replace the error values

def convert_datatypes(val):
    new_val = val.replace('$$ER: 4540,NO DATA VALUES FOUND','').replace('$$ER: E100,NO WORLDSCOPE DATA FOR THIS CODE', '')
    return new_val

df.apply(convert_datatypes)

The code worked but I checked again and realised that there were other error values such as "$$ER: E100,INVALID CODE OR EXPRESSION ENTERED".

I am pretty sure there are other error values as well, would like to find out if there are any other ways to efficiently remove the error values AND AT THE SAME TIME, change the dtype of the columns to the supposedly correct dtype (i.e., from object to either int or str)?

Appreciate any form of help, thank you in advance!


Solution

  • This will do the trick:

    for col in df.columns[df.dtypes=='object']:
        df.loc[df[col].str.startswith('$$ER',na=False),col]=''
    

    You can also use contains() but you will have to specify regex=False

    for col in df.columns[df.dtypes=='object']:
        df.loc[df[col].str.contains('$$ER',na=False,regex=False),col]=''