python-3.x pandas csv scikit-learn logistic-regression

Keeping or not keeping header in the CSV for training

Is it always required to remove the header from an imported CSV for training?

This is what I have...

raw_data_df = [pd.read_csv(
            file, header=None, skiprows=[0], low_memory=False) for file in input_files]
train_data_df = pd.concat(raw_data_df)

We used header=None and skiprows=[0] when skipping the header, and we pass it to LogisticRegression().fit()

Or is it better for keeping the header?

Solution

If the headers in all files are all equal, then you can keep them. Or you only keep the header of the first file.

The advantage of having a header is that when you run the logistic regression, you can easily find out which coefficients belong to which column names (and so which coefficients are most important).

For example:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)

df_lr_coef = pd.DataFrame({
    'features': lr.classes_, 
    'coefficients': lr.coef_,
    'coef_abs': np.abs(lr.coef_),
}).sort_values(by='coef_abs', ascending=False)