Search code examples
pandasscikit-learnregressionpreprocessor

pandas turning all strings into numbers ( one hot encoding) for regression


I am trying to preprocess data that looks like this :

train.head(4)

    Id  MSSubClass  MSZoning    LotFrontage LotArea Street  Alley   LotShape    LandContour Utilities   ... PoolArea    PoolQC  Fence   MiscFeature MiscVal MoSold  YrSold  SaleType    SaleCondition   SalePrice
0   1.0 60.0    RL  65.0    8450    Pave    NaN Reg Lvl AllPub  ... 0   NaN NaN NaN 0   2   2008    WD  Normal  208500
1   2.0 20.0    RL  80.0    9600    Pave    NaN Reg Lvl AllPub  ... 0   NaN NaN NaN 0   5   2007    WD  Normal  181500
2   3.0 60.0    RL  68.0    11250   Pave    NaN IR1 Lvl AllPub  ... 0   NaN NaN NaN 0   9   2008    WD  Normal  223500
3   4.0 70.0    RL  60.0    9550    Pave    NaN IR1 Lvl AllPub  ... 0   NaN NaN NaN 0   2   2006    WD  Abnorml 140000
4 rows × 81 columns

I have to find a way to turn these strings into numbers, so that I can use them for regression. I also aware that if I simply number them, I might get introduce a wrong distance logic to it ( not one hot encoded). Does someone know a a smart way to do this?

N


Solution

  • You can try pandas.get_dummies() to encode categorical data. You can see the documentation here. It won't convert your integer values (i.e. it will leave them intact. See this example from official documentation).

    df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
                        'C': [1, 2, 3]})
    
    pd.get_dummies(df, prefix=['col1', 'col2'])
       C  col1_a  col1_b  col2_a  col2_b  col2_c
    0  1       1       0       0       1       0
    1  2       0       1       1       0       0
    2  3       1       0       0       0       1
    

    If the number of categorical features are large and the number of unique values per categorical is large as well, you can try Scikit-learn's DictVectorizer. See the documentation here .

    You can check this link to see which encoding to use based on your algorithm.