I've recently started to learn Python for machine learning purposes and I ran into a problem. I am inputting data from .csv file using Pandas and transforming the row values to arrays of numbers. I need to pass those arrays into sklearn function. My code is here:
# Imports
import pandas as pd
import numpy as np
import sklearn
import os
import seaborn as seabornInstance
import matplotlib.pyplot as plt
from sklearn import preprocessing
# Dataset input
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'Datasets', 'car.data')
data = pd.read_csv(filename)
label_encoder = preprocessing.LabelEncoder()
# Transforming text
x = [label_encoder.fit_transform(list(data[col])) for col in data.columns if col!='class']
y = [label_encoder.fit_transform(list(data['class']))]
The Problem now arrives here. I need to access the nested arrays inside of x and put them in my sklearn function, because fitting 'x' in there will throw an error:
ValueError: Found input variables with inconsistent numbers of samples: [6, 1]
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x[0], y[0], test_size = 0.08)
Is there a way to pass all of the nested arrays as x and not the whole x without writing x[0],x[1],..? I do not think it would work using a loop, since I need to pass it all at once, or am I wrong?
EDIT: The data I am importing are not numeric (string), that is why I am using a label_encoder to transform those values into numeric ones for use in KNN algorithm.
I think you're making it unnecessarily complicated. You shouldn't label encode your x
variable either. As the name says, it's for the label
, not the predictor variable. For your x
variable, you should use these lines:
x = df.loc[:, [i for i in df.columns if i != 'class']]
# or
x = df.drop('class', axis=1)
And for your y
variable:
y = label_encoder.fit_transform(df['class'])