Search code examples
stanford-nlp

CoreNLP TrueCaseAnnotator returns uppercased text in some cases


Pretty often I got uppercased results. In some case model works good, but in some worse. Is any chance to fix this?

Some example of bad cases:

  • World's Smallest Flower Vase! -> WORLD 'S SMALLEST FLOWER VASE !

  • Swarna Chaturvedy likes. Plants and few clicks away to win his Free terrace garden! -> SWARNA chaturvedy likes . Plants and few clicks away to WIN HIS FREE TERRACE GARDEN !

  • Thanos! Wins Infinity Gauntlet Fortnite: Battle Royale LIVE -> Thanos ! Wins Infinity Gauntlet FORTNITE : Battle Royale Live

  • DIY Static Orbit Sander With Hard Disk -> DIY STATIC ORBIT SANDER WITH HARD DISK

  • COOL CHRISTMAS CARDS -> COOL CHRISTMAS CARDS

  • This futuristic 3D printer uses light to print -> This futuristic 3D PRINTER USES LIGHT TO PRINT

  • Maia zooming for dinner -> MAIA ZOOMING FOR DINNER

  • Cosmetic surgeons use lasers to remove moles -> COSMETIC SURGEONS USE LASERS TO REMOVE MOLES

    @anelkasam

I tried to tune bias parameter but the issue is still there


Solution

  • Your best bet would be to train your own model. We may look into training a new model and distributing that at some point.

    You can look over the props files we used to train the model by extracting this file from the main models jar:

    edu/stanford/nlp/models/truecase/truecasing.fast.caseless.prop
    

    The training data is just space separated tokens, one sentence per line with the correct case. We can't distribute the training data we used for the model we distribute. Whatever text is your typical domain, you can just feed millions of sentences from that into the training process and train a new model which may perform better on your dataset.

    The training data we used has 1,301,730 sentences.

    There is a GitHub thread here about this: https://github.com/stanfordnlp/CoreNLP/issues/336

    The training command should be:

    java -Xmx100g edu.stanford.nlp.ie.crf.CRFClassifier -prop custom.prop
    

    For reference this is what the extracted properties file looks like:

    serializeTo=truecasing.fast.caseless.qn.ser.gz
    trainFileList=/scr/nlp/data/gale/NIST09/truecaser/crf/noUN.input
    testFile=/scr/nlp/data/gale/AE-MT-eval-data/mt06/cased/ref0
    
    map=word=0,answer=1
    
    wordFunction = edu.stanford.nlp.process.LowercaseFunction
    
    useClassFeature=true
    useWord=true
    useNGrams=true
    noMidNGrams=true
    maxNGramLeng=6
    usePrev=true
    useNext=true
    useLongSequences=true
    useSequences=true
    usePrevSequences=true
    useTypeSeqs=true
    useTypeSeqs2=true
    useTypeySequences=true
    useOccurrencePatterns=true
    useLastRealWord=true
    useNextRealWord=true
    useDisjunctive=true
    disjunctionWidth=5
    wordShape=chris2useLC
    usePosition=true
    useBeginSent=true
    useTitle=true
    
    useObservedSequencesOnly=true
    saveFeatureIndexToDisk=true
    normalize=true
    
    useQN=false
    QNSize=25
    
    maxLeft=1
    l1reg=1.0
    
    readerAndWriter=edu.stanford.nlp.sequences.TrueCasingForNISTDocumentReaderAndWriter
    featureFactory=edu.stanford.nlp.ie.NERFeatureFactory
    
    featureDiffThresh=0.02