I have a large dataset with more than 1000 columns, the dataset is messy with mixed dtypes. There are 2 int64 columns, 119 float columns and 1266 object columns.
I would like to begin data cleaning but realised there are several issues. As there are too many columns, visual inspection of the data to locate errors is too tedious. An sample of the dataset is below
Company ID Year Date Actual Loan Loss Depreciation Accounts Payable
001 2001 19 Oct 2001 100000.00 40000 $$ER: 4540,NO DATA VALUES FOUND
002 2002 18 Sept 2001 NaN $$ER: E100,NO WORLDSCOPE DATA FOR THIS CODE
003 2004 01 Aug 2000 145000.00 5000 Finance Dept
I would like to remove all the error variables before dropping the null rows. The error variables typically start with "$$ER:"
I've tried the following
#load the dataset
df = pd.read_excel("path/file1.xlsx", sheet_name = "DATA_TS")
#examine the data
df.head(20)
#check number of rows, cols and dtypes
df.info()
#create a function to replace the error values
def convert_datatypes(val):
new_val = val.replace('$$ER: 4540,NO DATA VALUES FOUND','').replace('$$ER: E100,NO WORLDSCOPE DATA FOR THIS CODE', '')
return new_val
df.apply(convert_datatypes)
The code worked but I checked again and realised that there were other error values such as "$$ER: E100,INVALID CODE OR EXPRESSION ENTERED".
I am pretty sure there are other error values as well, would like to find out if there are any other ways to efficiently remove the error values AND AT THE SAME TIME, change the dtype of the columns to the supposedly correct dtype (i.e., from object to either int or str)?
Appreciate any form of help, thank you in advance!
This will do the trick:
for col in df.columns[df.dtypes=='object']:
df.loc[df[col].str.startswith('$$ER',na=False),col]=''
You can also use contains()
but you will have to specify regex=False
for col in df.columns[df.dtypes=='object']:
df.loc[df[col].str.contains('$$ER',na=False,regex=False),col]=''