Search code examples
stanford-nlpstemming

Stanford CoreNLP Morphology.stemStatic disable lowercase conversion?


The comments on the stemStatic method of the Morphology class state that it will:

return a new WordTag which has the lemma as the value of word().
The default is to lowercase non-proper-nouns, unless options have been set.

(https://github.com/evandrix/stanford-corenlp/blob/master/src/edu/stanford/nlp/process/Morphology.java)

How/where can I set those options, to disable the lowercase conversion?

I've looked through the source but can't see how I can set options that will affect this static method. Frustratingly, the related static lemmatise method -- lemmaStatic -- includes a boolean parameter to do exactly this...

I'm using v3.3.1 via Maven...

thanks!


Solution

  • Ok after looking at this for a bit, it seems the right track might be to not use the static method, but instead build a Morphology instance with:

    public Morphology(Reader in, int flags) {
    

    The int flags will set the lexer.options.

    Here are the lexer options (from Morpha.java) :

    /** If this option is set, print the word affix after a + character */
    private final static int print_affixes = 0;  
    /** If this option is set, lowercase all tokens */
    private final static int change_case = 1;
    /** Return the tags on the input words if present?? */
    private final static int tag_output= 2;
    

    The int flags is the bit string for the 3 options, so 7 = 111 , meaning all options will be set to true , 0 = 000 , all options false, 5 = 101 would set print_affixes and tag_output, etc...

    Then you can use apply in Morphology.java

    public Object apply(Object in) {
    

    Object in should be a WordTag built with the original word and tag.

    Please let me know if you need any further assistance!

    We could also change Morphology.java to have the kind of method you want! The above is if you don't want to play around with customizing Stanford CoreNLP.