python pandas machine-learning scikit-learn one-hot-encoding

"ValueError: A given column is not a column of the dataframe" when trying to convert categorical feature into numerical

I am using a csv file from a Udemy course for the sake of training. I only want to use age and country columns to keep things simple. Here is the code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer as ct
from sklearn.model_selection import train_test_split as tts

data = pd.read_csv("advertising.csv")

X = data[["Age","Country"]]
y = data[["Clicked on Ad"]]


from sklearn.preprocessing import OneHotEncoder
cat = X["Country"]
one_hot = OneHotEncoder()
transformer = ct([("one_hot", one_hot, cat)],remainder="passthrough")
transformed_X = transformer.fit_transform(X)

print(transformed_X)

I get this error:

runfile('C:/Users/--/.spyder-py3/untitled0.py', wdir='C:/Users/--/.spyder-py3')
Traceback (most recent call last):

  File "C:\Anaconda\lib\site-packages\pandas\core\indexes\base.py", line 2895, in get_loc
    return self._engine.get_loc(casted_key)

  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item

  File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item

KeyError: 'Tunisia'


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "C:\Anaconda\lib\site-packages\sklearn\utils\__init__.py", line 447, in _get_column_indices
    col_idx = all_columns.get_loc(col)

  File "C:\Anaconda\lib\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc
    raise KeyError(key) from err

KeyError: 'Tunisia'


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "C:\Users\--\.spyder-py3\untitled0.py", line 17, in <module>
    transformed_X = transformer.fit_transform(X)

  File "C:\Anaconda\lib\site-packages\sklearn\compose\_column_transformer.py", line 529, in fit_transform
    self._validate_remainder(X)

  File "C:\Anaconda\lib\site-packages\sklearn\compose\_column_transformer.py", line 327, in _validate_remainder
    cols.extend(_get_column_indices(X, columns))

  File "C:\Anaconda\lib\site-packages\sklearn\utils\__init__.py", line 454, in _get_column_indices
    raise ValueError(

ValueError: A given column is not a column of the dataframe

"Tunisia" is the first country under the column of "Country"

What might have caused the problem?

Thank you in advance.

Solution

The problem occurs because you are not specifying the column to transform correctly. In this line:

transformer = ct([("one_hot", one_hot, cat)],remainder="passthrough")

cat should stand for the index or the name of the column you want to transform. However, you are passing a whole dataframe because you set cat = X["Country"].

To fix this issue, just use one of the follwing:

#option 1
cat = ['Country']

# option 2
cat = [1]

and it should work fine.