Search code examples
python-3.xpandasdataframepreprocessorsklearn-pandas

How to detect suspicious error in a column of a dataset?


I was trying encoding of data in the dataset named as train.csv provided in this github repository. I used the following code to do so.

import pandas as pd 
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
df.head()
df['MSSubClass'].fillna(df['MSSubClass'].mean()//1)
df['MSZoning'].fillna(df['MSZoning'].mode())
label_encoder = preprocessing.LabelEncoder() 
for col in df.columns:
    if df[col].dtype == 'O':
        print(df[col])
        df[col] = label_encoder.fit_transform(df[col])
print(df) 

And while encoding, the following output prompted.

MSSubClass
MSZoning
LotFrontage
LotArea
Street
Alley
TypeError: '<' not supported between instances of 'str' and 'float'

But when I looked the dataset, there wasn't any '<' in the Alley column. And the previous columns have been encoded, but the Alley column is causing an error. Please help me!

This is the colab notebook of the code


Solution

  • There is problem your missing values are not replaced in all columns, need assign back, also added .iloc[0] to mode for select first, if 2 or more values:

    from sklearn import preprocessing
    df = pd.read_csv(r'train.csv',index_col='Id')
    print (df)
    
    colsNum = df.select_dtypes(np.number).columns
    colsObj = df.columns.difference(colsNum)
    
    df[colsNum] = df[colsNum].fillna(df[colsNum].mean()//1)
    df[colsObj] = df[colsObj].fillna(df[colsObj].mode().iloc[0])
    
    label_encoder = preprocessing.LabelEncoder() 
    for col in colsObj:
        print(df[col])
        df[col] = label_encoder.fit_transform(df[col])
    

    print (df)
          MSSubClass  MSZoning  LotFrontage  LotArea  Street  Alley  LotShape  \
    Id                                                                          
    1             60         3         65.0     8450       1      0         3   
    2             20         3         80.0     9600       1      0         3   
    3             60         3         68.0    11250       1      0         0   
    4             70         3         60.0     9550       1      0         0   
    5             60         3         84.0    14260       1      0         0   
             ...       ...          ...      ...     ...    ...       ...   
    1456          60         3         62.0     7917       1      0         3   
    1457          20         3         85.0    13175       1      0         3   
    1458          70         3         66.0     9042       1      0         3   
    1459          20         3         68.0     9717       1      0         3   
    1460          20         3         75.0     9937       1      0         3   
    
          LandContour  Utilities  LotConfig  ...  PoolArea  PoolQC  Fence  \
    Id                                       ...                            
    1               3          0          4  ...         0       2      2   
    2               3          0          2  ...         0       2      2   
    3               3          0          4  ...         0       2      2   
    4               3          0          0  ...         0       2      2   
    5               3          0          2  ...         0       2      2   
              ...        ...        ...  ...       ...     ...    ...   
    1456            3          0          4  ...         0       2      2   
    1457            3          0          4  ...         0       2      2   
    1458            3          0          4  ...         0       2      0   
    1459            3          0          4  ...         0       2      2   
    1460            3          0          4  ...         0       2      2   
    
          MiscFeature  MiscVal  MoSold  YrSold  SaleType  SaleCondition  SalePrice  
    Id                                                                              
    1               2        0       2    2008         8              4     208500  
    2               2        0       5    2007         8              4     181500  
    3               2        0       9    2008         8              4     223500  
    4               2        0       2    2006         8              0     140000  
    5               2        0      12    2008         8              4     250000  
              ...      ...     ...     ...       ...            ...        ...  
    1456            2        0       8    2007         8              4     175000  
    1457            2        0       2    2010         8              4     210000  
    1458            2     2500       5    2010         8              4     266500  
    1459            2        0       4    2010         8              4     142125  
    1460            2        0       6    2008         8              4     147500  
    
    [1460 rows x 80 columns]