Search code examples
python-3.xcsvsklearn-pandaskeyerror

Why does the error 'The above exception was the direct cause of the following exception:' come up on Python


I am trying to get my CSV processed with nlargest and I've run into this error. Any reasons as to why it could be? I'm trying to get my head around it but it just doesn't seem to go away.

import pandas as pd
from matplotlib import pyplot
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from pandas import read_csv
from pandas.plotting import scatter_matrix


filename = '/Users/rahulparmeshwar/Documents/Algo Bots/Data/Live Data/Tester.csv'
data = pd.read_csv(filename)
columnname = 'Scores'
bestfeatures = SelectKBest(k='all')
y = data['Vol']
X = data.drop('Open',axis=1)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
featurescores = pd.concat([dfscores,dfcolumns],axis=1)
print(featurescores.nlargest(5,[columnname]))

It gives me the error Scores the above exception was the direct cause of the following exception on the last line print(featurescores.nlargest(5,[columnname])). Can someone explain to me why this is happening? I've looked around and can't seem to figure this out.

EDIT: Full Error Stack:

Exception has occurred: KeyError 'Scores'

The above exception was the direct cause of the following exception:

File "C:\Users\mattr\OneDrive\Documents\Python AI\AI.py", line 19, in <module> print(featurescores.nlargest(2,'Scores'))


Solution

  • The exception KeyError means that the concatenated dataframe featurescores does not have a column with name "Scores".

    The problem is the created DataFrames dfscores and dfcolumns for which no column names are defined explicitly, so their single column names will be the "default" 0. That is, after the concatenation you get a DataFrame (featurescores) similar to this:

               0         0
    0         xxx     col1_name
    1         xxx     col2_name
    2         xxx     col3_name
    ...
    

    If you want to refer to the columns by name, you should define the column names explicitly as follows:

    >>> dfscores = pd.DataFrame(fit.scores_, columns=["Scores"])
    >>> dfcolumns = pd.DataFrame(X.columns, columns=["Features"])
    >>> featurescores = pd.concat([dfscores,dfcolumns], axis=1)
    >>> print(featurescores.nlargest(5, "Scores"))
    
           Scores   Features
    0       xxx       col_name1
    1       xxx       col_name2
    2       xxx       col_name3
    ...
    

    If you want to use the features as index, here is a one liner:

    >>> featurescores = pd.DataFrame(data=fit.scores_.transpose(), index=X.columns.transpose(), columns=["Scores"])
    >>> print(featurescores)
    
                   Scores
    col_name1       xxx
    col_name2       xxx
    col_name3       xxx
    ...