I have a language classification dataset with 22000 entries, 1000 for each of 22 languages. Can someone please advise how could I write classification model using simple linear regression, so it would be not one model picking of 0, 1, 2, … 22 values, but it would be 22 models picking between 1 and 0 (correct and incorrect). How is better to rewrite my y target?
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')
label_encoder = preprocessing.LabelEncoder()
df['language']= label_encoder.fit_transform(df['language'])
x = np.array(df['Text'])
y = np.array(df['language'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)
model = LinearRegression()
model.fit(X_train, y_train)
You are describing one-vs-rest (OvR) multiclass classification. To be able to do this, you need to one-hot-encode the column "language"
and then iterate over the new columns fitting one column for each dummy-column:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')
# One-hot encoding
y = pd.get_dummies(df['language'])
print(y.columns.tolist())
x = np.array(df['Text'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)
# You'll now have 22 models, one for each language
models = []
scores = []
for language in y.columns:
model = LinearRegression()
model.fit(X_train, y_train[language])
score = model.score(X_test, y_test[language])
models.append(model)
scores.append(score)
# Now, 'scores' is a list of R^2 scores of the models
for language, score in zip(y.columns, scores):
print(f"Model for language '{language}' R^2 score: {score}")
This works, but I advise you to use LogisticRegression
instead of LinearRegression
, because LinearRegression
is not bound the values 0
and 1
.
Use LogisticRegression
instead. It will only estimate 0
and 1
, i.e. is it that specific language or not. Additionally you can calculate the probabilities if it is that specific language or not.
Instead of fitting multiple models, you could wrap your classifier with OneVsRestClassifier
.