Search code examples
pythonpandasdataframemissing-datafillna

How to fill NaN values according to the data type in pandas data frame?


Greeting everyone. I have an excel file that I need to clean and fill NaN values according to column data types, like if column data type is object I need to fill "NULL" in that column and if data types is integer or float 0 needs to be filled in those columns.

So far I have tried 2 method to do the job but no luck, here is the first

df = pd.read_excel("myExcel_files.xlsx")

using bulit method for selecting columns by data types

df.select_dtypes(include='int64').fillna(0, inplace=True)

df.select_dtypes(include='float64').fillna(0.0, inplace=True)

df.select_dtypes(include='object').fillna("NULL", inplace=True)

and the output that I get is not an error but a warning and there is no change in data frame

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py:4259: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  **kwargs

as the first one was slice error so I thought doing it one column at a time, here is the code

df = pd.read_excel("myExcel_files.xlsx")

#get the list of all integer columns
int_cols = list(df.select_dtypes('int64').columns)

#get the list of all float columns
float_cols = list(df.select_dtypes('float64').columns)

#get the list of all object columns
object_cols = list(df.select_dtypes('object').columns)

#looping through if each column to fillna
for i in int_cols:
    df[i].fillna(0,inplace=True)

for f in float_cols:
    df[f].fillna(0,inplace=True)

for o in object_cols:
    df[o].fillna("NULL",inplace=True)

Both of my methods doesn't work. Many thanks for any help or suggestions. Regards -Manish


Solution

  • I think that instead of using select_dtypes and iterating over columns you can take the .dtypes of your DF and replace float64's wth 0.0 and objects with "NULL"... you don't need to worry about int64's as they generally won't have missing values to fill (unless you're using pd.NA or a nullable int type), so you might be able to do a single operation of:

    df.fillna(df.dtypes.replace({'float64': 0.0, 'O': 'NULL'}), inplace=True)
    

    You can also add downcast='infer' so that if you have what can be int64s in a float64 column, you end up with int64s, eg given:

    df = pd.DataFrame({
        'a': [1.0, 2, np.nan, 4],
        'b': [np.nan, 'hello', np.nan, 'blah'],
        'c': [1.1, 1.2, 1.3, np.nan]
    })
    

    Then:

    df.fillna(df.dtypes.replace({'float64': 0.0, 'O': 'NULL'}), downcast='infer', inplace=True)
    

    Will give you (note column a was downcast to int but c remains float):

       a      b    c
    0  1   NULL  1.1
    1  2  hello  1.2
    2  0   NULL  1.3
    3  4   blah  0.0