Search code examples
machine-learningnlpopennlppos-tagger

How to train Open NLP without file


i have the following code for training Open NLP POS Tagger

Trainer(String trainingData, String modelSavePath, String dictionary){

    try {
        dataIn = new MarkableFileInputStreamFactory(
                new File(trainingData));

        lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
        ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);

        POSTaggerFactory fac=new POSTaggerFactory();
        if(dictionary!=null && dictionary.length()>0)
        {
            fac.setDictionary(new Dictionary(new FileInputStream(dictionary)));
        }
        model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), fac);

    } catch (IOException e) {
        // Failed to read or parse training data, training failed
        e.printStackTrace();
    } finally {
        if (lineStream != null) {
            try {
                lineStream.close();
            } catch (IOException e) {
                // Not an issue, training already finished.
                // The exception should be logged and investigated
                // if part of a production system.
                e.printStackTrace();
            }
        }
    }
}

and this works just fine. Now, is it possible to do the same without involving files? I want to store the training data in a database somewhere. Then i can read it as a stream or chunks and feed it to the trainer. I do not want to create a temp file. Is this possible?


Solution

  • Yes, instead of passing FileInputStream to a dictionary, you can create your own implementation of InputStream, say DatabaseSourceInputStream and use it instead.