For some reason I am using the WEKA API...
I have generated tf-idf scores for a set of documents,
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(data);
filter.setIDFTransform(true);
filter.setStopwordsHandler(new StopWordsHandlerEN());//just a simple handler for stop words I created
filter.setLowerCaseTokens(true);
filter.setStemmer(new MyStemmer());//a stemmer I created
filter.setWordsToKeep(words2keep);
Instances result = Filter.useFilter(data, filter);
then split them into train and test subsets, did training, testing and all that...
Once I had a trained ready-to-go model for classification I wanted to create a plain API that would classify any incoming document. But the thing is the new tf-idf scores need to be calculated based on the tf-idf vector and the words of the starting set of documents, right? In other words, if I am not mistaken, I need to load a counterpart of the scikit-learn's tfidfvectorizer
.
I cannot find anything like it in WEKA... Is there?..
The StringToWordVector filter uses the weka.core.DictionaryBuilder class under the hood for the TF/IDF computation.
As long as you create a weka.core.Instance
object with the text that you want to have converted, you can do that using the builder's vectorizeInstance(Instance)
method.
Edit 1:
Below is an example based on your code (but with Weka classes), which shows how to either use the filter or the DictionaryBuilder for the TF/IDF transformation. Both get serialized, deserialized and re-used as well to demonstrate that these classes are serializable:
import weka.core.DictionaryBuilder;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.SerializationHelper;
import weka.core.converters.ConverterUtils;
import weka.core.stemmers.LovinsStemmer;
import weka.core.stopwords.Rainbow;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.StringToWordVector;
public class TFIDF {
// just exposes the internal DictionaryBuilder member
public static class StringToWordVectorExposed
extends StringToWordVector {
public DictionaryBuilder getDictionary() {
return m_dictionaryBuilder;
}
}
public static void main(String[] args) throws Exception {
// load data
Instances train = ConverterUtils.DataSource.read("/some/where/train.arff");
train.setClassIndex(train.numAttributes() - 1);
Instances test = ConverterUtils.DataSource.read("/some/where/test.arff");
test.setClassIndex(test.numAttributes() - 1);
// init filter
StringToWordVectorExposed filter = new StringToWordVectorExposed();
int words2keep = 100;
filter.setInputFormat(train);
filter.setIDFTransform(true);
filter.setStopwordsHandler(new Rainbow());
filter.setLowerCaseTokens(true);
filter.setStemmer(new LovinsStemmer());
filter.setWordsToKeep(words2keep);
filter.setInputFormat(train);
Instances trainFiltered = Filter.useFilter(train, filter);
DictionaryBuilder builder = filter.getDictionary();
// apply filter/dictionary
Instances testFiltered = Filter.useFilter(test, filter);
System.out.println(testFiltered.instance(0));
Instance tfidf = builder.vectorizeInstance(test.instance(0));
System.out.println(tfidf);
// serialize
SerializationHelper.write("/some/where/filter.ser", filter);
SerializationHelper.write("/some/where/dictionary.ser", filter.getDictionary());
// deserialize
StringToWordVectorExposed filter2 = (StringToWordVectorExposed) SerializationHelper.read("/some/where/filter.ser");
DictionaryBuilder builder2 = (DictionaryBuilder) SerializationHelper.read("/some/where/dictionary.ser");
// re-apply
testFiltered = Filter.useFilter(test, filter2);
System.out.println(testFiltered.instance(0));
tfidf = builder2.vectorizeInstance(test.instance(0));
System.out.println(tfidf);
}
}