Search code examples
pythonscikit-learnlinear-regression

Using simple linear regression for multiple classification task


I have a language classification dataset with 22000 entries, 1000 for each of 22 languages. Can someone please advise how could I write classification model using simple linear regression, so it would be not one model picking of 0, 1, 2, … 22 values, but it would be 22 models picking between 1 and 0 (correct and incorrect). How is better to rewrite my y target?

import numpy as np 
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')
label_encoder = preprocessing.LabelEncoder()
df['language']= label_encoder.fit_transform(df['language'])
x = np.array(df['Text'])
y = np.array(df['language'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)
model = LinearRegression()
model.fit(X_train, y_train)

Solution

  • You are describing one-vs-rest (OvR) multiclass classification. To be able to do this, you need to one-hot-encode the column "language" and then iterate over the new columns fitting one column for each dummy-column:

    import numpy as np 
    import pandas as pd
    from sklearn import preprocessing
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split
    
    df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')
    
    # One-hot encoding
    y = pd.get_dummies(df['language'])
    
    print(y.columns.tolist())
    
    x = np.array(df['Text'])
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(x)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)
    
    # You'll now have 22 models, one for each language
    models = []
    scores = []
    for language in y.columns:
        model = LinearRegression()
        model.fit(X_train, y_train[language])
        score = model.score(X_test, y_test[language])
        models.append(model)
        scores.append(score)
    
    # Now, 'scores' is a list of R^2 scores of the models
    for language, score in zip(y.columns, scores):
        print(f"Model for language '{language}' R^2 score: {score}")
    

    This works, but I advise you to use LogisticRegression instead of LinearRegression, because LinearRegression is not bound the values 0 and 1.

    Use LogisticRegression instead. It will only estimate 0 and 1, i.e. is it that specific language or not. Additionally you can calculate the probabilities if it is that specific language or not.

    Instead of fitting multiple models, you could wrap your classifier with OneVsRestClassifier.