I am trying to get my CSV processed with nlargest and I've run into this error. Any reasons as to why it could be? I'm trying to get my head around it but it just doesn't seem to go away.
import pandas as pd
from matplotlib import pyplot
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from pandas import read_csv
from pandas.plotting import scatter_matrix
filename = '/Users/rahulparmeshwar/Documents/Algo Bots/Data/Live Data/Tester.csv'
data = pd.read_csv(filename)
columnname = 'Scores'
bestfeatures = SelectKBest(k='all')
y = data['Vol']
X = data.drop('Open',axis=1)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
featurescores = pd.concat([dfscores,dfcolumns],axis=1)
print(featurescores.nlargest(5,[columnname]))
It gives me the error Scores
the above exception was the direct cause of the following exception on the last line print(featurescores.nlargest(5,[columnname]))
. Can someone explain to me why this is happening? I've looked around and can't seem to figure this out.
EDIT: Full Error Stack:
Exception has occurred: KeyError 'Scores'
The above exception was the direct cause of the following exception:
File "C:\Users\mattr\OneDrive\Documents\Python AI\AI.py", line 19, in <module> print(featurescores.nlargest(2,'Scores'))
The exception KeyError
means that the concatenated dataframe featurescores
does not have a column with name "Scores".
The problem is the created DataFrames dfscores
and dfcolumns
for which no column names are defined explicitly, so their single column names will be the "default" 0
.
That is, after the concatenation you get a DataFrame (featurescores
) similar to this:
0 0
0 xxx col1_name
1 xxx col2_name
2 xxx col3_name
...
If you want to refer to the columns by name, you should define the column names explicitly as follows:
>>> dfscores = pd.DataFrame(fit.scores_, columns=["Scores"])
>>> dfcolumns = pd.DataFrame(X.columns, columns=["Features"])
>>> featurescores = pd.concat([dfscores,dfcolumns], axis=1)
>>> print(featurescores.nlargest(5, "Scores"))
Scores Features
0 xxx col_name1
1 xxx col_name2
2 xxx col_name3
...
If you want to use the features as index, here is a one liner:
>>> featurescores = pd.DataFrame(data=fit.scores_.transpose(), index=X.columns.transpose(), columns=["Scores"])
>>> print(featurescores)
Scores
col_name1 xxx
col_name2 xxx
col_name3 xxx
...