python machine-learning encoding nlp categorical-data

Convert categorical data into numerical data in Python

I have a data set. One of its columns - "Keyword" - contains categorical data. The machine learning algorithm that I am trying to use takes only numeric data. I want to convert "Keyword" column into numeric values - How can I do that? Using NLP? Bag of words?

I tried the following but I got ValueError: Expected 2D array, got 1D array instead.

from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
dataset['Keyword'] = count_vector.fit_transform(dataset['Keyword'])
from sklearn.model_selection import train_test_split
y=dataset['C']
x=dataset(['Keyword','A','B'])
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)

Solution

You probably want to use an Encoder. One of the most used and popular ones are LabelEncoder and OneHotEncoder. Both are provided as parts of sklearn library.

LabelEncoder can be used to transform categorical data into integers:

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
x = ['Apple', 'Orange', 'Apple', 'Pear']
y = label_encoder.fit_transform(x)
print(y)

array([0, 1, 0, 2])

This would transform a list of ['Apple', 'Orange', 'Apple', 'Pear'] into [0, 1, 0, 2] with each integer corresponding to an item. This is not always ideal for ML as the integers have different numerical values, suggesting that one is bigger than the other, with, for example Pear > Apple, which is not at all the case. To not introduce this kind of problem you'd want to use OneHotEncoder.

OneHotEncoder can be used to transform categorical data into one hot encoded array. Encoding previously defined y by using OneHotEncoder would result in:

from numpy import array
from numpy import argmax
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)
y = y.reshape(len(y), 1)
onehot_encoded = onehot_encoder.fit_transform(y)
print(onehot_encoded)

[[1. 0. 0.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]]

Where each element of x turns into an array of zeroes and just one 1 which encodes the category of the element.

A simple tutorial on how to use this on a DataFrame can be found here.