CoreNLP TrueCaseAnnotator returns uppercased text in some cases

Pretty often I got uppercased results. In some case model works good, but in some worse. Is any chance to fix this?

Some example of bad cases:

World's Smallest Flower Vase! -> WORLD 'S SMALLEST FLOWER VASE !
Swarna Chaturvedy likes. Plants and few clicks away to win his Free terrace garden! -> SWARNA chaturvedy likes . Plants and few clicks away to WIN HIS FREE TERRACE GARDEN !
Thanos! Wins Infinity Gauntlet Fortnite: Battle Royale LIVE -> Thanos ! Wins Infinity Gauntlet FORTNITE : Battle Royale Live
DIY Static Orbit Sander With Hard Disk -> DIY STATIC ORBIT SANDER WITH HARD DISK
COOL CHRISTMAS CARDS -> COOL CHRISTMAS CARDS
This futuristic 3D printer uses light to print -> This futuristic 3D PRINTER USES LIGHT TO PRINT
Maia zooming for dinner -> MAIA ZOOMING FOR DINNER
Cosmetic surgeons use lasers to remove moles -> COSMETIC SURGEONS USE LASERS TO REMOVE MOLES

@anelkasam

I tried to tune bias parameter but the issue is still there

Solution

Your best bet would be to train your own model. We may look into training a new model and distributing that at some point.

You can look over the props files we used to train the model by extracting this file from the main models jar:

edu/stanford/nlp/models/truecase/truecasing.fast.caseless.prop

The training data is just space separated tokens, one sentence per line with the correct case. We can't distribute the training data we used for the model we distribute. Whatever text is your typical domain, you can just feed millions of sentences from that into the training process and train a new model which may perform better on your dataset.

The training data we used has 1,301,730 sentences.

There is a GitHub thread here about this: https://github.com/stanfordnlp/CoreNLP/issues/336

The training command should be:

java -Xmx100g edu.stanford.nlp.ie.crf.CRFClassifier -prop custom.prop

For reference this is what the extracted properties file looks like:

serializeTo=truecasing.fast.caseless.qn.ser.gz
trainFileList=/scr/nlp/data/gale/NIST09/truecaser/crf/noUN.input
testFile=/scr/nlp/data/gale/AE-MT-eval-data/mt06/cased/ref0

map=word=0,answer=1

wordFunction = edu.stanford.nlp.process.LowercaseFunction

useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useLongSequences=true
useSequences=true
usePrevSequences=true
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
useOccurrencePatterns=true
useLastRealWord=true
useNextRealWord=true
useDisjunctive=true
disjunctionWidth=5
wordShape=chris2useLC
usePosition=true
useBeginSent=true
useTitle=true

useObservedSequencesOnly=true
saveFeatureIndexToDisk=true
normalize=true

useQN=false
QNSize=25

maxLeft=1
l1reg=1.0

readerAndWriter=edu.stanford.nlp.sequences.TrueCasingForNISTDocumentReaderAndWriter
featureFactory=edu.stanford.nlp.ie.NERFeatureFactory

featureDiffThresh=0.02