Search code examples
pandasdataframeunique

Different outcome using pandas nunique() and unique()


I have a big DF with 10 millions rows and I need to find the unique number for each column.

I wrote the function below: (need to return a series)

def count_unique_values(df):
    return pd.Series(df.nunique())

and I get this output:

Area          210
Item          436
Element         4
Year           53
Unit            2
Value      313640
dtype: int64

expected result should be value 313641.

when I just do

df['Value'].unique()

I do get that answer. Didn't figure out why I get less with nunique() just there.


Solution

  • Because DataFrame.nunique omit missing values, because default parameter dropna=True, Series.unique function not.

    Sample:

    df = pd.DataFrame({
            'A':list('abcdef'),
            'D':[np.nan,3,5,5,3,5],
    
    })
    
    print (df)
       A    D
    0  a  NaN
    1  b  3.0
    2  c  5.0
    3  d  5.0
    4  e  3.0
    5  f  5.0
    
    def count_unique_values(df):
        return df.nunique()
    
    print (count_unique_values(df))
    A    6
    D    2
    dtype: int64
    
    print (df['D'].unique())
    [nan  3.  5.]
    

    print (df['D'].nunique())
    2
    
    print (df['D'].unique())
    [nan  3.  5.]
    

    Solution is add parameter dropna=False:

    print (df['D'].nunique(dropna=False))
    3
    
    print (df['D'].unique())
    3
    

    So in your function:

    def count_unique_values(df):
        return df.nunique(dropna=False)
    print (count_unique_values(df))
    A    6
    D    3
    dtype: int64