Search code examples
pythonpandasscikit-learnnaivebayesmultinomial

How to predict data outside of the training data set


using this module to predict country names from address:

import re
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
def normalize_text(s):
    s = s.lower()
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)
    s = re.sub('\s+',' ',s)
    return(s)
df['TEXT'] = [normalize_text(s) for s in df['Full_Address']]

vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df['TEXT'])

encoder = LabelEncoder()
y = encoder.fit_transform(df['CountryName'])

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
accuracy_score(y_test, y_predicted)

I want to use the module I built to predict a single string address, how can I do this? I tried:

nb.predict('1100 112th Ave NE #400, Bellevue, WA 98004, United States')

ValueError: Expected 2D array, got scalar array instead:
array=1100 112th Ave NE #400, Bellevue, WA 98004, United States.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

UPDATE:

as suggested in an answer:

nb.predict([['1100 112th Ave NE #400, Bellevue, WA 98004, United States']])

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 82043 is different from 1)

Solution

  • To predict you need to pass your data through all the preprocessing steps you have done to train your model:

    single_address = '1100 112th Ave NE #400, Bellevue, WA 98004, United States'
    normalized_address = normalize_text(single_address)
    vectorized_address = vectorizer.transform([normalized_address])
    #expected output
    nb.predict(vectorized_address)
    

    Note 2 ways to improve your code:

    1. normalize_text step is not really necessary as it is, as all what it does will be caught by CountVectorizer's tokenizer regex token_pattern='(?u)\\b\\w\\w+\\b' and lowercase=True

    2. Keep all of the preprocessing in a sklearn Pipeline. This way your code will be cleaner and less error prone (and you'll for sure avoid errors like you've got)

    A working [canonical?] template how to achieve that:

    from sklearn.naive_bayes import MultinomialNB
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.preprocessing import LabelEncoder
    from sklearn.pipeline import Pipeline
    
    X = 30*['1100 112th Ave NE #400, Bellevue, WA 98004, United States']
    y = 10*['US','France','Germany']
    
    le = LabelEncoder()
    y = le.fit_transform(y)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    vectorizer = CountVectorizer()
    mnb = MultinomialNB()
    
    ppl = Pipeline(steps=[('vectorizer',vectorizer),('mnb',mnb)])
    
    ppl.fit(X_train, y_train)
    single_address = '1100 112th Ave NE #400, Bellevue, WA 98004, United States'
    ppl.predict([single_address])
    

    An extra benefit of having a Pipeline you can pass it through GridSearchCV so that best params are chosen through cross validation.