Search code examples
javaneural-networkword2vecdeeplearning4j

How to clear vocab cache in DeepLearning4j Word2Vec so it will be retrained everytime


Thanks in advance. I am using Word2Vec in DeepLearning4j.

How do I clear the vocab cache in Word2Vec. This is because I want it to retrain on a new set of word patterns every time I reload Word2Vec. For now, it seems that the vocabulary of the previous set of word patterns persists and I get the same result even though I changed my input training file.

I try to reset the model, but it doesn't work. Codes:-

Word2Vec vec = new Word2Vec.Builder() .minWordFrequency(1) .iterations(1) .layerSize(4) .seed(1) .windowSize(1) .iterate(iter) .tokenizerFactory(t) .resetModel(true) .limitVocabularySize(1) .build();

Anyone can help?


Solution

  • If you want to retrain (this is called training), I understand that you just want to completely ignore previous learned model (vocabulary, words vector, ...). To do that you should create another Word2Vec object and fit it with new data. You should use an other instance for SentenceIterator and Tokenizer classes so. Your problem could be the way you change your input training files.

    It should be ok if you just change the SentenceIterator, i.e :

    SentenceIterator iter = new CollectionSentenceIterator(DataFetcher.getFirstDataset());
    Word2Vec vec = new Word2Vec.Builder()
                .iterate(iter)
                ....
                .build();
    
    vec.fit();
    
    vec.wordsNearest("clear", 10); // you will see results from first dataset
    
    SentenceIterator iter2 = new CollectionSentenceIterator(DataFetcher.getSecondDataset());
    vec =  new Word2Vec.Builder()
        .iterate(iter2)
        ....
        .build();
    
    vec.fit();
    
    vec.wordsNearest("clear", 10); // you will see results from second dataset, without any first dataset implication
    

    If you run the code twice and you changed your input data between executions (let's say A and then B) you shouldn't have the same results. If so that's mean your model learned the same thing with input data A and B.

    If you want to update training (this is called inference), I mean use previous learned model and new data to update this model, then you should use this example from dl4j examples.