python csv machine-learning scikit-learn sklearn-pandas

sklearn .fit transformers , IndexError: tuple index out of range

I'm using a "ColumnTransformer" even though I'm transforming only one feature because I don't know how else to change only the "clean_text" feature. I am not using a "make_column_transformer" with a "make_column_selector" because I would like to use a gridsearch later but I don't understand why I can't find column 0 of the dataset

import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split

#dataset download: https://www.kaggle.com/saurabhshahane/twitter-sentiment-dataset 

df = pd.read_csv('Twitter_Data.csv')
y = df1['category']   #target
X = df1['clean_text'].values.astype('U') #feature, i transformed "X" into a string even if in theory it was because otherwise it would return an error

transformers = [
    ['text_vectorizer', CountVectorizer(), [0]];
]

ct = ColumnTransformer(transformers, remainder='passthrough')

ct.fit(X) #<---IndexError: tuple index out of range
X = ct.transform(X)

Solution

Imo there are a couple of points to be highlighted on this example:

CountVectorizer requires its input to be 1D. In such cases, documentation for ColumnTransformer states that

columns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable

Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer.

Therefore, the columns parameter should be passed as an int rather than as a list of int. I would also suggest Sklearn custom transformers with pipeline: all the input array dimensions for the concatenation axis must match exactly for another reference.

Given that you're using a column transformer, I would pass the whole dataframe to method .fit() called on the ColumnTransformer instance, rather than X only.

The dataframe seems to have missing values; it might be convenient to process them somehow. For instance, by dropping them and applying what is described above I was able to make it work, but you can also decide to proceed differently.

 import pandas as pd
 import numpy as np
 from sklearn.compose import ColumnTransformer
 from sklearn.feature_extraction.text import CountVectorizer
 from sklearn.model_selection import train_test_split

 #dataset download: https://www.kaggle.com/saurabhshahane/twitter-sentiment-dataset 
 df = pd.read_csv('Twitter_Data.csv')
 y = df['category']  
 X = df['clean_text']

 df.info()

 df_n = df.dropna()

 transformers = [
     ('text_vectorizer', CountVectorizer(), 0)
 ]

 ct = ColumnTransformer(transformers, remainder='passthrough')

 ct.fit(df_n) 
 ct.transform(df_n)

As specified within the comments, transformers should be specified as a list of tuples (as per the documentation) rather than as list of lists. However, running the snippet above with your transformers specification seems to work. I've eventually observed that substituting tuples with lists elsewhere (in unrelated pieces of code I have) seems not to raise issues. However, as per my experience, it is for sure more common to find them passed as list of tuples.