I have data on various customer attributes (self-description and age), and a binary outcome of whether these customers would buy a specific product
{"would_buy": "No",
"self_description": "I'm a college student studying biology",
"Age": 19},
I'd like to use MultinomialNB
on self-description
to predict would_buy
, and then incorporate those predictions into a logistic regression model on would_buy
that takes also takes age
as a covariate.
Code for the text model so far (I am new to SciKit!) with a simplified dataset.
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
#Customer data that includes whether a customer would buy an item (what I'm interested), their self-description, and their age.
data = [
{"would_buy": "No", "self_description": "I'm a college student studying biology", "Age": 19},
{"would_buy": "Yes", "self_description": "I'm a blue-collar worker", "Age": 20},
{"would_buy": "No", "self_description": "I'm a Stack Overflow denzien", "Age": 56},
{"would_buy": "No", "self_description": "I'm a college student studying economics", "Age": 20},
{"would_buy": "Yes", "self_description": "I'm a UPS worker", "Age": 35},
{"would_buy": "No", "self_description": "I'm a Stack Overflow denzien", "Age": 56}
]
def naive_bayes_model(customer_data):
self_descriptions = [customer['self_description'] for customer in customer_data]
decisions = [customer['would_buy'] for customer in customer_data]
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
X = vectorizer.fit_transform(self_descriptions, decisions)
naive_bayes = MultinomialNB(alpha=0.01)
naive_bayes.fit(X, decisions)
train(naive_bayes, X, decisions)
def train(classifier, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=22)
classifier.fit(X_train, y_train)
print(classification_report(classifier.predict(X_test), y_test))
def main():
naive_bayes_model(data)
main()
The short answer would be to use the predict_proba
or predict_log_proba
methods on your trained naive_bayes
to create the inputs for your logistic regression model. These could be concatenated with the Age
values to create the training and testing sets for your LogisticRegression model.
However, I do want to point out that the code as you have written does not give you access to your naive_bayes
model after it is trained. So you definitely need to restructure your code.
That issue aside, this is how I would incorporate the output of naive_bayes
into a LogisticRegression:
descriptions = np.array([customer['self_description'] for customer in data])
decisions = np.array([customer['would_buy'] for customer in data])
ages = np.array([customer['Age'] for customer in data])
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
desc_vec = vectorizer.fit_transform(descriptions, decisions)
naive_bayes = MultinomialNB(alpha=0.01)
desc_train, desc_test, age_train, age_test, dec_train, dec_test = train_test_split(desc_vec, ages, decisions, test_size=0.25, random_state=22)
naive_bayes.fit(desc_train, dec_train)
nb_train_preds = naive_bayes.predict_proba(desc_train)
lr = LogisticRegression()
lr_X_train = np.hstack((nb_tarin_preds, age_train.reshape(-1, 1)))
lr.fit(lr_X_train, dec_train)
lr_X_test = np.hstack((naive_bayes.predict_proba(desc_test), age_test.reshape(-1, 1)))
lr.score(lr_X_test, dec_test)