Search code examples
pythonpandasmissing-data

How do I replace missing values with NaN


I am using the IMDB dataset for machine learning, and it contains a lot of missing values which are entered as '\N'. Specifically in the StartYear column which contains the movie year release I want to convert the values to integers. Which im not able to do right now, I could drop these values but I wanted to see why they're missing first. I tried several things but no success.

This is my latest attempt:

My attempt


Solution

  • Here is a way to do it without using replace:

    import pandas as pd
    import numpy as np
    df_basics = pd.DataFrame({'startYear':['\\N']*78760+[2017]*18267 + [2018]*18263+[2016]*17837+[2019]*17769+['1996 ','1993 ','2000 ','2019 ','2029 ']})
    print(pd.value_counts(df_basics.startYear))
    df_basics.loc[df_basics.startYear == '\\N','startYear'] = np.NaN
    print(pd.value_counts(df_basics.startYear, dropna=False))
    

    Output:

    NaN      78760
    2017     18267
    2018     18263
    2016     17837
    2019     17769
    1996         1
    1993         1
    2000         1
    2019         1
    2029         1