Search code examples
pythonpandaskerasscikit-learnone-hot-encoding

What is difference between pd.get_dummies and sklearn one hot encoder in python?


What is difference between pd.get_dummies and sklearn one hot encoder in python ? As per my knowledge both do same works,Can any one tells what is the main difference between pd.get_dummies and sklearn one hot encoder ,on which one is more efficient at present.


Solution

  • 1. Output difference

    pd.get_dummies results to a Pandas DataFrame whereas OneHotEncoder results a SciPy CSR matrix.

    Example -

    s = pd.Series([1, 2, 3, 4, 5])
    0    1
    1    2
    2    3
    3    4
    4    5
    dtype: int64
    
    type(pd.get_dummies(s))
    pandas.core.frame.DataFrame
    
    from sklearn.preprocessing import OneHotEncoder
    enc = OneHotEncoder()
    enc.fit_transform(s.values.reshape(-1, 1))#.toarray() # Can be converted to NumPy ndarray using .toarray
    scipy.sparse.csr.csr_matrix
    

    2. Time complexity

    pd.get_dummies is much faster than the OneHotEncoder

    Example -
    s = pd.Series([1, 2, 3, 4, 5]*50000)
    len(s)
    250000
    
    %timeit pd.get_dummies(s)
    15.2 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit enc.fit_transform(s.values.reshape(-1, 1))
    34.1 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    %timeit enc.fit_transform(s.values.reshape(-1, 1)).toarray() # more reusable
    45.3 ms ± 5.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    3. Input data dependency

    As explained in the old post