I'm using a "ColumnTransformer" even though I'm transforming only one feature because I don't know how else to change only the "clean_text" feature. I am not using a "make_column_transformer" with a "make_column_selector" because I would like to use a gridsearch later but I don't understand why I can't find column 0 of the dataset
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
#dataset download: https://www.kaggle.com/saurabhshahane/twitter-sentiment-dataset
df = pd.read_csv('Twitter_Data.csv')
y = df1['category'] #target
X = df1['clean_text'].values.astype('U') #feature, i transformed "X" into a string even if in theory it was because otherwise it would return an error
transformers = [
['text_vectorizer', CountVectorizer(), [0]];
]
ct = ColumnTransformer(transformers, remainder='passthrough')
ct.fit(X) #<---IndexError: tuple index out of range
X = ct.transform(X)
Imo there are a couple of points to be highlighted on this example:
CountVectorizer
requires its input to be 1D. In such cases, documentation for ColumnTransformer
states thatcolumns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer.
Therefore, the columns
parameter should be passed as an int rather than as a list of int. I would also suggest Sklearn custom transformers with pipeline: all the input array dimensions for the concatenation axis must match exactly for another reference.
Given that you're using a column transformer, I would pass the whole dataframe to method .fit()
called on the ColumnTransformer
instance, rather than X
only.
The dataframe seems to have missing values; it might be convenient to process them somehow. For instance, by dropping them and applying what is described above I was able to make it work, but you can also decide to proceed differently.
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
#dataset download: https://www.kaggle.com/saurabhshahane/twitter-sentiment-dataset
df = pd.read_csv('Twitter_Data.csv')
y = df['category']
X = df['clean_text']
df.info()
df_n = df.dropna()
transformers = [
('text_vectorizer', CountVectorizer(), 0)
]
ct = ColumnTransformer(transformers, remainder='passthrough')
ct.fit(df_n)
ct.transform(df_n)
As specified within the comments, transformers
should be specified as a list of tuples (as per the documentation) rather than as list of lists. However, running the snippet above with your transformers
specification seems to work. I've eventually observed that substituting tuples with lists elsewhere (in unrelated pieces of code I have) seems not to raise issues. However, as per my experience, it is for sure more common to find them passed as list of tuples.