Search code examples
pythonscikit-learnnltkjoblibtfidfvectorizer

TfidfVectorizer model loaded from joblib file only works when trained in same session


sklearn...TfidfVectorizer only works when it is applied just after it is trained, when the analyzer returns a list of nltk.tree.Tree objects. This is a mystery because the model always loaded from a file before being applied. Debugging shows nothing wrong or different about the model file when it is loaded and applied at the start of its own session, vs. when trained in that session. The analyzer is also applied and working correctly in both cases.

Below is a script to help reproduce the mysterious behavior:

import joblib
import numpy as np
from nltk import Tree
from sklearn.feature_extraction.text import TfidfVectorizer

def lexicalized_production_analyzer(sentence_trees):
    productions_per_sentence = [tree.productions() for tree in sentence_trees]
    return np.concatenate(productions_per_sentence)

def train(corpus):
    model = TfidfVectorizer(analyzer=lexicalized_production_analyzer)
    model.fit(corpus)
    joblib.dump(model, "model.joblib")

def apply(corpus):
    model = joblib.load("model.joblib")
    result = model.transform(corpus)
    return result

# exmaple data
trees = [Tree('ROOT', [Tree('FRAG', [Tree('S', [Tree('VP', [Tree('VBG', ['arkling']), Tree('NP', [Tree('NP', [Tree('NNS', ['dots'])]), Tree('VP', [Tree('VBG', ['nestling']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['grass'])])])])])])]), Tree(',', [',']), Tree('VP', [Tree('VBG', ['winking']), Tree('CC', ['and']), Tree('VP', [Tree('VBG', ['glimmering']), Tree('PP', [Tree('IN', ['like']), Tree('NP', [Tree('NNS', ['jewels'])])])])]), Tree('.', ['.'])])]),
 Tree('ROOT', [Tree('FRAG', [Tree('NP', [Tree('NP', [Tree('NNP', ['Rose']), Tree('NNS', ['petals'])]), Tree('NP', [Tree('NP', [Tree('ADVP', [Tree('RB', ['perhaps'])]), Tree(',', [',']), Tree('CC', ['or']), Tree('NP', [Tree('DT', ['some'])]), Tree('NML', [Tree('NN', ['kind'])])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NN', ['confetti'])])])])]), Tree('.', ['.'])])])]
corpus = [trees, trees, trees]

First train the model and save the model.joblib file.

train(corpus)
result = apply(corpus)
print("number of elements in results: " + str(result.getnnz()))
print("shape of results: " + str(result.shape))

We print the number of results .getnnz() to show that the model is working with 120 elements counted:

number of elements in results: 120
shape of results: (3, 40)

Then restart python and re-apply the model to the same corpus, without training.

result = apply(corpus)
print("number of elements in results: " + str(result.getnnz()))
print("shape of results: " + str(result.shape))

You will see that zero elements are saved this time.

number of elements in results: 0
shape of results: (3, 40)

But the model was loaded from a file both times and no global variables (I know of) so we can't think of why it works in one case and doesn't work in another.

Thanks for the help!


Solution

  • Ok, I did some very deep digging and if you check the Production class here that you are implicitly using with the Tree structure, it looks like they store _hash when the class is created. However the Python hash function is indeterministic between runs, meaning this value will probably not be consistent across runs. Therefore the hash is pickled with joblib rather than being re-calculated as it should be. So this seems like a bug in nltk. This results in the model not seeing the production rule when reloaded because the hash does not match so it's as if the production rule was never stored in the vocab.

    Quite tricky!

    Until this specific nltk is fixed, setting the PYTHONHASHSEED before running both the training and testing scripts will force the hash to be the same each time.

    PYTHONHASHSEED=0 python script.py