Search code examples
pythondataframeappendstatsmodelskeyerror

A strange keyerror is preventing me from testing my logit regression classifier?


I'm trying to run a logit regression from statsmodel in python in a for loop. So I'm appending a row each time from the test data into my training data dataframe and rerunning the regression and storing the results.

Now, funny thing is, the test data is not getting appended correctly (which I think is causing the KeyError:0 that I am getting, but inviting your opinions here). I've tried importing two versions of the test data- one with the same labels as training data and another with no declared labels.

Here is my code:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import datetime 

df_train = pd.read_csv('Adult-Incomes/train-labelled-final-variables-condensed-coded-countries-removed-unlabelled-income-to-the-left-relabelled-copy.csv')
print('Training set')
print(df_train.head(15))

train_cols = df_train.columns[1:]
logit = sm.Logit(df_train['Income'], df_train[train_cols])
result = logit.fit()

print("ODDS RATIO")
print(result.params)
print("RESULTS SUMMARY")
print(result.summary())
print("CONFIDENCE INTERVAL")
print(result.conf_int())

#appnd test data

print("PREDICTION PROCESS")
print("READING TEST DATA")
df_test = pd.read_csv('Adult-Incomes/test-final-variables-cleaned-coded-copy-relabelled.csv')
print("TEST DATA READ COMPLETE")

iteration_time = []
iteration_result = []
iteration_params = []
iteration_conf_int = []

df_train.to_pickle('train_iteration.pickle')
print(df_test.head())

print("Loop begins")

for row in range(0,len(df_test)):
    start_time = datetime.datetime.now()
    print("Loop iteration ", row, " in ", len(df_test), " rows")

    df_train = pd.read_pickle('train_iteration.pickle')
    print("pickle read")
    df_train.append(df_test[row])
    print("row ", row, " appended")
    train_cols = df_train.columns[1:]
    print("X variables extracted in new DataFrame")
    logit = sm.Logit(df_train['Income'], df_train[train_cols])
    print("Def logit reg eqn")
    result = logit.fit()
    print("fit logit reg eqn")
    iteration_result[row] = result.summary()
    print("logit result summary stored in array")
    iteration_params[row] = result.params
    print("logit params stored in array")
    iteration_conf_int[row] = result.conf_int()
    print("logit conf_int stored in array")

    df_train.to_pickle('train_iteration.pickle')
    print("exported to pickle")

    end_time = datetime.datetime.now()
    time_diff = start_time - end_time
    print("time for this iteration is ", time_diff)
    iteration_time[row] = time_diff
    print("ending iteration, starting next iteration of loop...")

print("Loop ends")

pd.DataFrame(iteration_result)
pd.DataFrame(iteration_time)
print (iteration_result.head())
print (iteration_time.head())

It prints upto here:

Loop iteration  0  in  15060  rows
pickle read

but then generates a KeyError: 0

What am I doing wrong here?

version of test data that have labels matching training data:

   Income  Age  Workclass  Education  Marital_Status  Occupation  \
0       0    1          4          7               4           6   
1       0    1          4          9               2           4   
2       1    1          6         12               2          10   
3       1    1          4         10               2           6   
4       0    1          4          6               4           7   

   Relationship  Race  Sex  Capital_gain  Capital_loss  Hours_per_week  
0             3     2    0             0             0              40  
1             0     4    0             0             0              50  
2             0     4    0             0             0              40  
3             0     2    0          7688             0              40  
4             1     4    0             0             0              30  

Version of test data that have no labels:

   0  1  4   7  4.1   6  3  2  0.1   0.2  0.3  40
0  0  1  4   9    2   4  0  4    0     0    0  50
1  1  1  6  12    2  10  0  4    0     0    0  40
2  1  1  4  10    2   6  0  2    0  7688    0  40
3  0  1  4   6    4   7  1  4    0     0    0  30
4  1  2  2  15    2   9  0  4    0  3103    0  32

In both cases, if I use labelled or unlabelled training data, I'm still getting the same error at the same point.

Anyone guide me on how best to proceed?

UPDATE: here is the full error message (first three lines are print statements, error starts from fourth line):

Loop begins
Loop iteration  0  in  15060  rows
pickle read
Traceback (most recent call last):

  File "<ipython-input-10-1f56d5243e43>", line 1, in <module>
    runfile('/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py', wdir='/media/deepak/Laniakea/Projects/Training/SPYDER/classifier')

  File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 866, in runfile
    execfile(filename, namespace)

  File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py", line 64, in <module>
    df_train.append(df_test[row])

  File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2059, in __getitem__
    return self._getitem_column(key)

  File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2066, in _getitem_column
    return self._get_item_cache(key)

  File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 1386, in _get_item_cache
    values = self._data.get(item)

  File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3541, in get
    loc = self.items.get_loc(item)

  File "/usr/local/lib/python3.5/dist-packages/pandas/indexes/base.py", line 2136, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))

  File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4443)

  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289)

  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13733)

  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13687)

KeyError: 0

UDPATE: I get this in the last line of print(df_train.std()) statement, after the std dev of all the columns. dtype: float64 So, I'm guessing my training data frame is being treated as float.


Solution

  • I think I got it... instead of the below code -

    df_train.append(df_test[row])
    print("row ", row, " appended")
    

    Rewrite it to -

    df_train.append(df_test.iloc[row])
    df_train = df_train.reset_index()
    print("row ", row, " appended")
    

    Let me know if this serves the purpose...its kind of essential to reset the index everytime...Just one thing though - if your test set is fairly large this would be a computational disaster, training for every data point seen in test...

    Just a piece of advice outside context - if you do want to train it near-real time, just try using batches or chunks of the test set...