Search code examples
javatext-miningtraining-dataopennlp

Add training data to existing model (bin file)


I'm trying to add extra training data to my nl-personTest.bin file with OpenNLP. Now is my problem that when I run my code to add the extra trainingsdata it removes the already existing data and only add my new data.

How can I just add extra trainingsdata instead of replacing it?

I did use the following code, (got it from Open NLP NER is not properly trained)

public class TrainNames
    {
    public static void main(String[] args) 
    {
        train("nl", "person", "namen.txt", "nl-ner-personTest.bin");
    }

    public static String train(String lang, String entity,InputStreamFactory inputStream, FileOutputStream modelStream) {

        Charset charset = Charset.forName("UTF-8");
        TokenNameFinderModel model = null;
        ObjectStream<NameSample> sampleStream = null;
        try {
            ObjectStream<String> lineStream = new PlainTextByLineStream(inputStream, charset);
            sampleStream = new NameSampleDataStream(lineStream);
            TokenNameFinderFactory nameFinderFactory = new TokenNameFinderFactory();
            model = NameFinderME.train("nl", "person", sampleStream, TrainingParameters.defaultParams(),
                nameFinderFactory);
        } catch (FileNotFoundException fio) {

        } catch (IOException io) {

        } finally {
            try {
                sampleStream.close();
            } catch (IOException io) {

            }
        }
        BufferedOutputStream modelOut = null;
        try {
            modelOut = new BufferedOutputStream(modelStream);
            model.serialize(modelOut);
        } catch (IOException io) {

        } finally {
            if (modelOut != null) {
                try {
                    modelOut.close();
                } catch (IOException io) {

                }
            }
        }
        return "Something goes wrong with training module.";
    }

    public static String train(String lang, String entity, String taggedCoprusFile,
                               String modelFile) {
        try {
            InputStreamFactory inputStream = new InputStreamFactory() {
                FileInputStream fileInputStream = new FileInputStream("namen.txt");

                public InputStream createInputStream() throws IOException {
                    return fileInputStream;
                }
            };

            return train(lang, entity, inputStream,
                new FileOutputStream(modelFile));
        } catch (Exception e) {
            e.printStackTrace();
        }
        return "Something goes wrong with training module.";
    } }

Anyone any ideas to solve this problem?

Because If I want to have an accurate trainingset I need to have at least 15K sentences says the documation.


Solution

  • I think that OpenNLP does not support to expand existing binary NLP models.

    If you have all training data available, collect them all and then train them at once. You can use SequenceInputStream. I modified your example to use another InputStreamFactory

    public String train(String lang, String entity, InputStreamFactory inputStream, FileOutputStream modelStream) {
    
        // ....
        try {
            ObjectStream<String> lineStream = new PlainTextByLineStream(trainingDataInputStreamFactory(Arrays.asList(
                    new File("trainingdata1.txt"),
                    new File("trainingdata2.txt"),
                    new File("trainingdata3.txt")
            )), charset);
    
            // ...
        } 
    
        // ...
    }
    
    private InputStreamFactory trainingDataInputStreamFactory(List<File> trainingFiles) {
        return new InputStreamFactory() {
            @Override
            public InputStream createInputStream() throws IOException {
                List<InputStream> inputStreams = trainingFiles.stream()
                        .map(f -> {
                            try {
                                return new FileInputStream(f);
                            } catch (FileNotFoundException e) {
                                e.printStackTrace();
                                return null;
                            }
                        })
                        .filter(Objects::nonNull)
                        .collect(Collectors.toList());
    
                return new SequenceInputStream(new Vector<>(inputStreams).elements());
            }
        };
    }