Search code examples
scikit-learndata-sciencedata-preprocessing

Why is ColumnTransformer producing a different output using the same code but different .csv files?


I am trying to finish this course tooth and nail with the hopes of being able to do this kind of stuff entry level by Spring time. This is my first post here on this incredible resource, and will do my best to conform to posting format. As a potential way to enforce my learning and commit to long term memory, I'm trying the same things on my own dataset of > 500 entries containing data more relevant to me as opposed to dummy data.

I'm learning about the data preprocessing phase where you fill in missing values and separate the columns into their respective X and Y to be fed into the models later on, if I understand correctly.

So in the course example, it's the top left dataset of countries. Then the bottom left is my own database of data I've been keeping for about a year on a multiplayer game I play. It has 100 or so characters you can choose from who are played between 5 different categorical roles.

Course data set (top left) personal dataset (bottom left personal dataset column transformed results

What's up with the different outputs that are produced, with the only difference being the dataset (.csv file)? The course's dataset looks right; that first column of countries (textual categories) gets turned into binary vectors in the output no? Why is the output on my data set omitting columns, and producing these bizarre looking tuples followed by what looks like a random number? I've tried removing the np.array function, I've tried printing each output at each level, unable to see what's causing the difference. I expected on my dataset it would transform the characters' names into binary vectors (combinations of 1s/0s?) so the computer can understand the difference and map them to the appropriate results. Instead I'm getting that weird looking output I've never seen before.

EDIT: It turns out these bizarre number combinations are what's called a "sparse matrix." Had to do some research starting with the type() which yielded csr_array. If I understood what I Read correctly all the stuff inside takes up one column, so I just tried all rows/columns using [:] and I didn't get an error.

Really appreciate your time and assistance.

EDIT: Thanks to this thread I was able to make my way to the end of this data preprocessing/import/cleaning/ phase exercise, to feature scaling using my own dataset of ~ 550 rows.

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# IMPORT RAW DATA // ASSIGN X AND Y RAW
df = pd.read_csv('datasets/winpredictor.csv')

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

# TRANSFORM CATEGORICAL DATA
ct = ColumnTransformer(transformers=\
[('encoder', OneHotEncoder(), [0, 1])], remainder='passthrough')
le = LabelEncoder()

X = ct.fit_transform(X)
y = le.fit_transform(y)

# SPLIT THE DATA INTO TRAINING AND TEST SETS
X_train, X_test, y_train, y_test = train_test_split(\
X, y, train_size=.8, test_size=.2, random_state=1)

# FEATURE SCALING
sc = StandardScaler(with_mean=False)

X_train[:, :] = sc.fit_transform(X_train[:, :])
X_test[:, :] = sc.transform(X_test[:, :])

Solution

  • First of all I encourage you to keep working with this course and for sure you will be a perfect Data Science in a few weeks.

    Let's talk about your problem. It' seems that you only have a problem of visualization due to the big size of different types of "Hero" (I think you have 37 unique values).

    I will explain you the results you have plotted. They programm only indicate you the values of the samples that are different of 0:

    • (0,10)=1 --> 0 refers to the first sample, and 10 refers to the 10th value of the sample that is equal to 1.

    • (0,37)=5 --> 0 refers to the first sample, and 37 refers to the 37th, which is equal to 5.

    etc..

    So your first sample will be something like:

    [0,0,0,0,0,0,0,0,0,0,1,.........., 5, 980,-30, 1000, 6023]
    

    Which is the way to express the first sample of "Jakiro".

    ["Jakiro",5, 980,-30, 1000, 6023]
    

    To sump up, the first 37 values refers to your OneHotEncoder, and last 5 refers to your initial numerical values.

    So it seems to be correct, just a different way to plot the result due to the big size of classes of the categorical variable.

    You can try to reduce the number of X rows (to 4 for example), and try the same process. Then you will have a similar output as the course.