Search code examples
pythonpandasmachine-learningdummy-variable

How to do create dummy variables for prediction from user input (only one record)?


I am trying to create a web application for predicting airline delays. I have trained my model offline on my computer, and now am trying to make a Flask app to make predictions based on user input. For simplicity, lets say my model has 3 categorical variables: UNIQUE_CARRIER, ORIGIN and DESTINATION. While training, I create dummy variables of all 3 using pandas:

df = pd.concat([df, pd.get_dummies(df['UNIQUE_CARRIER'], drop_first=True, prefix="UNIQUE_CARRIER")], axis=1)
df = pd.concat([df, pd.get_dummies(df['ORIGIN'], drop_first=True, prefix="ORIGIN")], axis=1)
df = pd.concat([df, pd.get_dummies(df['DEST'], drop_first=True, prefix="DEST")], axis=1)
df.drop(['UNIQUE_CARRIER', 'ORIGIN', 'DEST'], axis=1, inplace=True)

So now my feature vector is 297 long (assuming there are 100 different unique carriers and 100 different airports in my data). I saved my model using pickle, and now am trying to predict based on user input. Now the user input is in the form of 3 variables (origin, destination, carrier).

Obviously I cannot use pd.get_dummies (because there would be only 1 unique value for all the three fields) for each user input. What is the most efficient way to convert the user input into the feature vector for my model?


Solution

  • Since you are using pandas dummies and hence dense vectors, a good way to create a new vector would be to create a dict of terms:vector_index and then populate a zeros vector according to it, something along the lines of the following:

    index_dict = dict(zip(df.columns,range(df.shape[1])))
    

    now when you have a new flight:

    new_vector = np.zeroes(297)
    try:
        new_vector[index_dict[origin]] = 1
    except:
        pass
    try:
        new_vector[index_dict[destination]] = 1
    except:
        pass
    try:
        new_vector[index_dict[carrier]] = 1
    except:
        pass