python machine-learning scikit-learn regression prediction

Trying to understand an example script on ML

I'm trying to work through an example script on machine learning: Common pitfalls in interpretation of coefficients of linear models but I'm having trouble understanding some of the steps. The beginning of the script looks like this:

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_openml

survey = fetch_openml(data_id=534, as_frame=True)

# We identify features `X` and targets `y`: the column WAGE is our
# target variable (i.e., the variable which we want to predict).
X = survey.data[survey.feature_names]
X.describe(include="all")

X.head()

# Our target for prediction is the wage.
y = survey.target.values.ravel()
survey.target.head()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
_ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde')

My problem is in the lines

y = survey.target.values.ravel()
survey.target.head()

If we examine survey.target.head() immediately after these lines, the output is

Out[36]: 
0    5.10
1    4.95
2    6.67
3    4.00
4    7.50
Name: WAGE, dtype: float64

How does the model know that WAGE is the target variable? Does is not have to be explicitly declared?

Solution

The line survey.target.values.ravel() is meant to flatten the array, but in this example it is not necessary. survey.target is a pd Series (i.e 1 column data frame) and survey.target.values is a numpy array. You can use both for train/test split since there is only 1 column in survey.target .

type(survey.target)
pandas.core.series.Series

type(survey.target.values)
numpy.ndarray

If we use just survey.target, you can see that the regression will work:

y = survey.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
sns.pairplot(train_dataset, kind='reg', diag_kind='kde')

If you have another dataset, for example iris, I want to regress petal width against the rest. You would call the column of the data.frame using the square brackets [] :

from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression

dat = load_iris(as_frame=True).frame

X = dat[['sepal length (cm)','sepal width (cm)','petal length (cm)']]
y = dat[['petal width (cm)']]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

LR = LinearRegression()
LR.fit(X_train,y_train)
plt.scatter(x=y_test,y=LR.predict(X_test))