I'm trying to run a logit regression from statsmodel in python in a for loop. So I'm appending a row each time from the test data into my training data dataframe and rerunning the regression and storing the results.
Now, funny thing is, the test data is not getting appended correctly (which I think is causing the KeyError:0 that I am getting, but inviting your opinions here). I've tried importing two versions of the test data- one with the same labels as training data and another with no declared labels.
Here is my code:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import datetime
df_train = pd.read_csv('Adult-Incomes/train-labelled-final-variables-condensed-coded-countries-removed-unlabelled-income-to-the-left-relabelled-copy.csv')
print('Training set')
print(df_train.head(15))
train_cols = df_train.columns[1:]
logit = sm.Logit(df_train['Income'], df_train[train_cols])
result = logit.fit()
print("ODDS RATIO")
print(result.params)
print("RESULTS SUMMARY")
print(result.summary())
print("CONFIDENCE INTERVAL")
print(result.conf_int())
#appnd test data
print("PREDICTION PROCESS")
print("READING TEST DATA")
df_test = pd.read_csv('Adult-Incomes/test-final-variables-cleaned-coded-copy-relabelled.csv')
print("TEST DATA READ COMPLETE")
iteration_time = []
iteration_result = []
iteration_params = []
iteration_conf_int = []
df_train.to_pickle('train_iteration.pickle')
print(df_test.head())
print("Loop begins")
for row in range(0,len(df_test)):
start_time = datetime.datetime.now()
print("Loop iteration ", row, " in ", len(df_test), " rows")
df_train = pd.read_pickle('train_iteration.pickle')
print("pickle read")
df_train.append(df_test[row])
print("row ", row, " appended")
train_cols = df_train.columns[1:]
print("X variables extracted in new DataFrame")
logit = sm.Logit(df_train['Income'], df_train[train_cols])
print("Def logit reg eqn")
result = logit.fit()
print("fit logit reg eqn")
iteration_result[row] = result.summary()
print("logit result summary stored in array")
iteration_params[row] = result.params
print("logit params stored in array")
iteration_conf_int[row] = result.conf_int()
print("logit conf_int stored in array")
df_train.to_pickle('train_iteration.pickle')
print("exported to pickle")
end_time = datetime.datetime.now()
time_diff = start_time - end_time
print("time for this iteration is ", time_diff)
iteration_time[row] = time_diff
print("ending iteration, starting next iteration of loop...")
print("Loop ends")
pd.DataFrame(iteration_result)
pd.DataFrame(iteration_time)
print (iteration_result.head())
print (iteration_time.head())
It prints upto here:
Loop iteration 0 in 15060 rows
pickle read
but then generates a KeyError: 0
What am I doing wrong here?
version of test data that have labels matching training data:
Income Age Workclass Education Marital_Status Occupation \
0 0 1 4 7 4 6
1 0 1 4 9 2 4
2 1 1 6 12 2 10
3 1 1 4 10 2 6
4 0 1 4 6 4 7
Relationship Race Sex Capital_gain Capital_loss Hours_per_week
0 3 2 0 0 0 40
1 0 4 0 0 0 50
2 0 4 0 0 0 40
3 0 2 0 7688 0 40
4 1 4 0 0 0 30
Version of test data that have no labels:
0 1 4 7 4.1 6 3 2 0.1 0.2 0.3 40
0 0 1 4 9 2 4 0 4 0 0 0 50
1 1 1 6 12 2 10 0 4 0 0 0 40
2 1 1 4 10 2 6 0 2 0 7688 0 40
3 0 1 4 6 4 7 1 4 0 0 0 30
4 1 2 2 15 2 9 0 4 0 3103 0 32
In both cases, if I use labelled or unlabelled training data, I'm still getting the same error at the same point.
Anyone guide me on how best to proceed?
UPDATE: here is the full error message (first three lines are print statements, error starts from fourth line):
Loop begins
Loop iteration 0 in 15060 rows
pickle read
Traceback (most recent call last):
File "<ipython-input-10-1f56d5243e43>", line 1, in <module>
runfile('/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py', wdir='/media/deepak/Laniakea/Projects/Training/SPYDER/classifier')
File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py", line 64, in <module>
df_train.append(df_test[row])
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2059, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3541, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python3.5/dist-packages/pandas/indexes/base.py", line 2136, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4443)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13733)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13687)
KeyError: 0
UDPATE:
I get this in the last line of print(df_train.std()) statement, after the std dev of all the columns. dtype: float64
So, I'm guessing my training data frame is being treated as float.
I think I got it... instead of the below code -
df_train.append(df_test[row])
print("row ", row, " appended")
Rewrite it to -
df_train.append(df_test.iloc[row])
df_train = df_train.reset_index()
print("row ", row, " appended")
Let me know if this serves the purpose...its kind of essential to reset the index everytime...Just one thing though - if your test set is fairly large this would be a computational disaster, training for every data point seen in test...
Just a piece of advice outside context - if you do want to train it near-real time, just try using batches or chunks of the test set...