What is difference between pd.get_dummies and sklearn one hot encoder in python ? As per my knowledge both do same works,Can any one tells what is the main difference between pd.get_dummies and sklearn one hot encoder ,on which one is more efficient at present.
1. Output difference
pd.get_dummies
results to a Pandas DataFrame whereas OneHotEncoder
results a SciPy CSR matrix.
Example -
s = pd.Series([1, 2, 3, 4, 5])
0 1
1 2
2 3
3 4
4 5
dtype: int64
type(pd.get_dummies(s))
pandas.core.frame.DataFrame
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit_transform(s.values.reshape(-1, 1))#.toarray() # Can be converted to NumPy ndarray using .toarray
scipy.sparse.csr.csr_matrix
2. Time complexity
pd.get_dummies
is much faster than the OneHotEncoder
Example -
s = pd.Series([1, 2, 3, 4, 5]*50000)
len(s)
250000
%timeit pd.get_dummies(s)
15.2 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit enc.fit_transform(s.values.reshape(-1, 1))
34.1 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit enc.fit_transform(s.values.reshape(-1, 1)).toarray() # more reusable
45.3 ms ± 5.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3. Input data dependency
As explained in the old post