Search code examples
machine-learningnlpstanford-nlphpc

Do some HPC clusters cache only one result when running Stanford CoreNLP?


I am using the Stanford CoreNLP library for a Java project. I created a class called StanfordNLP and instantiated two different objects and initalized the constructors with different strings as parameters. I am using the POS tagger to get adjective-noun sequences. However, the output of the program only shows me the results from the first object. Every StanfordNLP object was initialized with a different string but every object returns the same results as the first object. I'm new to Java, so I can't tell if there is a problem with my code or if there is a problem with the HPC cluster that it's running on.

Instead of returning a list of strings from the StanfordNLP class method, I tried using a getter. I've also tried setting the first StanfordNLP object to null so it doesn't reference anything then created the other objects. Nothing works.

/* in main */
List<String> pos_tokens0 = new ArrayList<String>();
List<String> pos_tokens1 = new ArrayList<String>();

String text0 = "Mary little lamb white fleece like snow"
StanfordNLP snlp0 = new StanfordNLP(text0);
pos_tokens0 = snlp0.process();

String text1 = "Everywhere little Mary went fluffy lamb ate green grass"
StanfordNLP snlp1 = new StanfordNLP(text1);
pos_tokens1 = snlp1.process();


/* in StanfordNLP.java */
public class StanfordNLP {

    private static List<String> pos_adjnouns = new ArrayList<String>();
    private String documentText = "";

    public StanfordNLP() {}
    public StanfordNLP(String text) { this.documentText = text; }

    public List<String> process() {     
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, depparse");
        props.setProperty("coref.algorithm", "neural");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);    
        Annotation document = new Annotation(documentText);
        pipeline.annotate(document);

        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        List<String[]> corpus_temp = new ArrayList<String[]>();
        int count = 0;
    
        for(CoreMap sentence: sentences) {
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                String[] data = new String[2];
                String word = token.get(TextAnnotation.class);
                String pos = token.get(PartOfSpeechAnnotation.class);
                count ++;

                data[0] = word;
                data[1] = pos;         
                corpus_temp.add(data);
            }           
        }
    
        String[][] corpus = corpus_temp.toArray(new String[count][2]);
    
        // corpus contains string arrays with a word and its part-of-speech.
        for (int i=0; i<(corpus.length-3); i++) { 
            String word = corpus[i][0];
            String pos = corpus[i][1];
            String word2 = corpus[i+1][0];
            String pos2 = corpus[i+1][1];

            // find adjectives and nouns (eg, "fast car")
            if (pos.equals("JJ")) {         
                if (pos2.equals("NN") || pos2.equals("NNP") || pos2.equals("NNPS")) {
                    word = word + " " + word2;
                    pos_adjnouns.add(word);
                }
            }
        }
        return pos_adjnouns;
}

Expected output for pos_tokens0 is "little lamb, white fleece".Expected output for pos_tokens1 is "little Mary, fluffy lamb, green grass". But the actual output for both variables is "little lamb, white fleece".

Any idea why this might be happening? I ran a simple Java jar file with a main.java and myclass.java on an HPC server and can't replicate this problem. So, it doesn't seem like the HPC server has issues with multiple objects of the same class.


Solution

  • The problem looks like it is simply that your pos_adjnouns variable is static, and so is shared between all instances of StanfordNLP…. Try removing the static keyword and see if things work as you expect then.

    But like that still isn't right, since you'd have an instance variable and on multiple calls to process(), things would keep being added to the pos_adjnouns list. Two other things you should do are:

    1. Make pos_adjnouns a method variable in the process() method
    2. Conversely, initializing a StanfordCoreNLP pipeline is expensive, so you should move that out of the process() method and do it in the class constructor. It'd probably be better for things to be exactly opposite and for the constructor to initialize a pipeline, and for the process() method to take a String to analyze.