I'm trying to work through an example script on machine learning: Common pitfalls in interpretation of coefficients of linear models but I'm having trouble understanding some of the steps. The beginning of the script looks like this:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_openml
survey = fetch_openml(data_id=534, as_frame=True)
# We identify features `X` and targets `y`: the column WAGE is our
# target variable (i.e., the variable which we want to predict).
X = survey.data[survey.feature_names]
X.describe(include="all")
X.head()
# Our target for prediction is the wage.
y = survey.target.values.ravel()
survey.target.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
_ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde')
My problem is in the lines
y = survey.target.values.ravel()
survey.target.head()
If we examine survey.target.head()
immediately after these lines, the output is
Out[36]:
0 5.10
1 4.95
2 6.67
3 4.00
4 7.50
Name: WAGE, dtype: float64
How does the model know that WAGE
is the target variable? Does is not have to be explicitly declared?
The line survey.target.values.ravel()
is meant to flatten the array, but in this example it is not necessary. survey.target is a pd Series (i.e 1 column data frame) and survey.target.values is a numpy array. You can use both for train/test split since there is only 1 column in survey.target
.
type(survey.target)
pandas.core.series.Series
type(survey.target.values)
numpy.ndarray
If we use just survey.target, you can see that the regression will work:
y = survey.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
sns.pairplot(train_dataset, kind='reg', diag_kind='kde')
If you have another dataset, for example iris, I want to regress petal width against the rest. You would call the column of the data.frame using the square brackets []
:
from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression
dat = load_iris(as_frame=True).frame
X = dat[['sepal length (cm)','sepal width (cm)','petal length (cm)']]
y = dat[['petal width (cm)']]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
LR = LinearRegression()
LR.fit(X_train,y_train)
plt.scatter(x=y_test,y=LR.predict(X_test))