Search code examples
python-3.xpandascsvscikit-learnlogistic-regression

Keeping or not keeping header in the CSV for training


Is it always required to remove the header from an imported CSV for training?

This is what I have...

raw_data_df = [pd.read_csv(
            file, header=None, skiprows=[0], low_memory=False) for file in input_files]
train_data_df = pd.concat(raw_data_df)

We used header=None and skiprows=[0] when skipping the header, and we pass it to LogisticRegression().fit()

Or is it better for keeping the header?


Solution

  • If the headers in all files are all equal, then you can keep them. Or you only keep the header of the first file.

    The advantage of having a header is that when you run the logistic regression, you can easily find out which coefficients belong to which column names (and so which coefficients are most important).

    For example:

    import numpy as np
    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    
    df_lr_coef = pd.DataFrame({
        'features': lr.classes_, 
        'coefficients': lr.coef_,
        'coef_abs': np.abs(lr.coef_),
    }).sort_values(by='coef_abs', ascending=False)