I found the powerful RegexNER and it's superset TokensRegex from Stanford CoreNLP.
There are some rules that should give me fine results, like the pattern for PERSONs with titles:
"g. Meho Mehic" or "gdin. N. Neko" (g. and gdin. are abbrevs in Bosnian for mr.).
I'm having some trouble with existing tokenizer. It splits some strings on two tokens and some leaves as one, for example, token "g." is left as word <word>g.</word>
and token "gdin." is split on 2 tokens: <word>gdin</word>
and <word>.</word>
.
That causes trouble with my regex, I have to deal with one-token and multi-token cases (note the two "maybe-dot"s), RegexNER example:
( /g\.?|gdin\.?/ /\./? ([{ word:/[A-Z][a-z]*\.?/ }]+) ) PERSON
Also, this causes another issue, with sentence splitting, some sentences are not well recognized so regex fails... For example, when a sentence contains "gdin." it will split it on two, so a dot will end the (non-existing) sentence. I managed to bypass this with ssplit.isOneSentence = true
for now.
Questions:
Ok I thought about this for a bit and can actually think of something pretty straight forward for your case. One thing you could do is add "gdin" to the list of titles in the tokenizer.
The tokenizer rules are in edu.stanford.nlp.process.PTBLexer.flex (look at line 741)
I do not really understand the tokenizer that well, but clearly there are a list of job titles in there, so they must be cases where it will not split off the period.
This will of course require you to work with a custom build of Stanford CoreNLP.
You can get the full code at our GitHub:https://github.com/stanfordnlp/CoreNLP
There are instructions on the main page for building a jar with all of the main Stanford CoreNLP classes. I think if you just run the ant process it will automatically generate the new PTBLexer.java based on PTBLexer.flex.