I keep receiving the following value error when I run my model:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
Here's the full version of the error:
Traceback (most recent call last):
File "/usr/lib/python3.8/site-packages/sklearn/utils/__init__.py", line 425, in _get_column_indices
all_columns = X.columns
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/Python Practice/Working/Playstore/untitled0.py", line 48, in <module>
run.fit(x,y)
File "/usr/lib/python3.8/site-packages/sklearn/pipeline.py", line 330, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "/usr/lib/python3.8/site-packages/sklearn/pipeline.py", line 292, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "/usr/lib/python3.8/site-packages/joblib/memory.py", line 352, in __call__
return self.func(*args, **kwargs)
File "/usr/lib/python3.8/site-packages/sklearn/pipeline.py", line 740, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/usr/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 529, in fit_transform
self._validate_remainder(X)
File "/usr/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 327, in _validate_remainder
cols.extend(_get_column_indices(X, columns))
File "/usr/lib/python3.8/site-packages/sklearn/utils/__init__.py", line 427, in _get_column_indices
raise ValueError("Specifying the columns using strings is only "
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
Here's my code:
import pandas as pd
import numpy as np
from sklearn.ensemble import AdaBoostRegressor
from category_encoders import CatBoostEncoder
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
data = pd.read_csv("data.csv",index_col=("Unnamed: 0"))
y = data.Installs
x = data.drop("Installs",axis=1)
strat = ["mean","median","most_frequent","constant"]
num_imp = SimpleImputer(strategy=strat[0])
obj_imp = SimpleImputer(strategy=strat[2])
# Set up the scaler
sc = StandardScaler()
# Set up Encoders
cb = CatBoostEncoder()
oh = OneHotEncoder(sparse=True)
# Set up columns
obj = list(x.select_dtypes(include="object"))
num = list(x.select_dtypes(exclude="object"))
cb_col = [i for i in obj if len(x[i].unique())>30]
oh_col = [i for i in obj if len(x[i].unique())<10]
# First Pipeline
imp = make_pipeline((num_imp))
enc_cb = make_pipeline((cb),(obj_imp))
enc_oh = make_pipeline((oh),(obj_imp))
# Col Transformation
col = make_column_transformer((imp,num),(sc,num))
cb_ = make_column_transformer((enc_cb,cb_col))
oh_ = make_column_transformer((enc_oh,oh_col))
model = AdaBoostRegressor(random_state=(0))
run = make_pipeline((col),(cb_),(oh_),(model))
run.fit(x,y)
Any ideas as to how I can fix it? The data used can be found here if you need it. Initially I tried performing all the column transformations in one go under a single transformer variable, but that didn't work and I was advised to separate them before running again. I did that, but with the result you see. I'd like some help please. Thank you!
I would not separate the column transformers like this. This way, in your run
pipeline, the first ColumnTransformer
col
converts the input from a pandas dataframe into a numpy array. But then the column names aren't around to be picked out by cb_
(and worse, the column order has changed, so you can't rely on the column indices from the original data).
See my answer to another of your questions for what I think is the simplest way to build this pipeline.