I am trying to preprocess data that looks like this :
train.head(4)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1.0 60.0 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2.0 20.0 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3.0 60.0 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4.0 70.0 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 rows × 81 columns
I have to find a way to turn these strings into numbers, so that I can use them for regression. I also aware that if I simply number them, I might get introduce a wrong distance logic to it ( not one hot encoded). Does someone know a a smart way to do this?
N
You can try pandas.get_dummies()
to encode categorical data. You can see the documentation here. It won't convert your integer values (i.e. it will leave them intact. See this example from official documentation).
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
'C': [1, 2, 3]})
pd.get_dummies(df, prefix=['col1', 'col2'])
C col1_a col1_b col2_a col2_b col2_c
0 1 1 0 0 1 0
1 2 0 1 1 0 0
2 3 1 0 0 0 1
If the number of categorical features are large and the number of unique values per categorical is large as well, you can try Scikit-learn's DictVectorizer. See the documentation here .
You can check this link to see which encoding to use based on your algorithm.