I am trying to create a web application for predicting airline delays. I have trained my model offline on my computer, and now am trying to make a Flask app to make predictions based on user input. For simplicity, lets say my model has 3 categorical variables: UNIQUE_CARRIER, ORIGIN and DESTINATION. While training, I create dummy variables of all 3 using pandas:
df = pd.concat([df, pd.get_dummies(df['UNIQUE_CARRIER'], drop_first=True, prefix="UNIQUE_CARRIER")], axis=1)
df = pd.concat([df, pd.get_dummies(df['ORIGIN'], drop_first=True, prefix="ORIGIN")], axis=1)
df = pd.concat([df, pd.get_dummies(df['DEST'], drop_first=True, prefix="DEST")], axis=1)
df.drop(['UNIQUE_CARRIER', 'ORIGIN', 'DEST'], axis=1, inplace=True)
So now my feature vector is 297 long (assuming there are 100 different unique carriers and 100 different airports in my data). I saved my model using pickle, and now am trying to predict based on user input. Now the user input is in the form of 3 variables (origin, destination, carrier).
Obviously I cannot use pd.get_dummies
(because there would be only 1 unique value for all the three fields) for each user input. What is the most efficient way to convert the user input into the feature vector for my model?
Since you are using pandas dummies and hence dense vectors, a good way to create a new vector would be to create a dict of terms:vector_index and then populate a zeros vector according to it, something along the lines of the following:
index_dict = dict(zip(df.columns,range(df.shape[1])))
now when you have a new flight:
new_vector = np.zeroes(297)
try:
new_vector[index_dict[origin]] = 1
except:
pass
try:
new_vector[index_dict[destination]] = 1
except:
pass
try:
new_vector[index_dict[carrier]] = 1
except:
pass