Is it always required to remove the header from an imported CSV for training?
This is what I have...
raw_data_df = [pd.read_csv(
file, header=None, skiprows=[0], low_memory=False) for file in input_files]
train_data_df = pd.concat(raw_data_df)
We used header=None
and skiprows=[0]
when skipping the header, and we pass it to LogisticRegression().fit()
Or is it better for keeping the header?
If the headers in all files are all equal, then you can keep them. Or you only keep the header of the first file.
The advantage of having a header is that when you run the logistic regression, you can easily find out which coefficients belong to which column names (and so which coefficients are most important).
For example:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
df_lr_coef = pd.DataFrame({
'features': lr.classes_,
'coefficients': lr.coef_,
'coef_abs': np.abs(lr.coef_),
}).sort_values(by='coef_abs', ascending=False)