nlp stanford-nlp data-analysis linguistics

NLP - linguistic consistency analysis

I hope you can help me :).

I am working for a translation company.

As you know, every translation consists in splitting the original text into small segments and then re-joining them into the final product.

In other words, the segments are considered as "translation units".

Often, especially for large documents, the translators make some linguistic consistency errors, I try to explain it with an example.

In Spanish, you can use "tu" or "usted", depending on the context, and this determines the formality-informality tone of the sentence.

So, if you consider these two sentences of a document:

Lara, te has lavado las manos? (TU)

Lara usted se lavò las manos? (USTED)

They are BOTH correct, but if you consider the whole document, there is a linguistic inconsistency.

I am studying NLP basic in my spare time, and I am figuring out how to create a tool to perform a linguistic consistency analysis on a set of sentences.

I am looking in particular at Standford CoreNLP (I prefer Java to Python). I guess that I need some linguistic tools to perform verb analysis first of all. And naturally, the tool would be able to work with different languages (EN, IT, ES, FR, PT).

Anyone can help me to figure out how to start this?

Any help would be appreciated,

thanks in advance!

Solution

Im not sure about Stanford CoreNLP, but if you're considering this an option, you could make your own tagger and use modifiers at pos tagging. Then, use this as a translation feature.

In other words, instead of just tagging a word to be a verb, you could tag it "a verb in the infinitive second person".

There are already good pre-tagged corpora out there for spanish that can help you do exactly that. For example, if you look at Universal Dependencies Ankora Corpus, you can find that there are annotations referring to the Person of a verb.

With a little tweaking, you could make a compose PoS that takes in "Verb-1st-Person" or something like that and train a Tagger.

I've made an article about how to do it in Python, but I bet that you can do it in Java using Weka. You can read the article here.

After this, I guess that the next step is that you ensure to match the person of one "translation unit" to the other, or make something in a pipeline fashion.