Any brazilian VS european Portuguese classification methods?

I have to find the method to classify a lot of the text samples (comments) written in Portuguese into two groups:

Brazilian Portuguese
European Portuguese

Google wasn't help much, maybe i miss something? Please help, thanks a lot!

It would be great is the some solution will be accessible for python.

A few of my ideas was to find the key grammar differences between those Portuguese variants, but as far as i can see this will not work much, also i was looking into direction where ML models which is capable do so. But to be honest i have no working ideas at all, which is make me ask the first question on that platform ever.

Solution

I assume you have many documents which bear an is_brazilian label.

Compute tf/idf statistics.

Look for unigrams or bigrams that are moderately common and whose appearance strongly predicts one or the other label.

Balanced collection policies are important here.

If, for example, most of the collected Brazilian documents related to sport, and European documents related to finance, then such a term frequency approach would really be predicting "sport vs finance" instead of predicting language locale.

A part of speech tagger can help with this, so you focus on common function words rather than on e.g. proper nouns that are specific to one geography or the other.

Apparently English (rather than Latin) loan words appear more often in Brazilian usage. And the appearance of 'recepção' predicts Brazilian, while 'receção' predicts European. Compare with 'Ônibus' versus 'Autocarro'. Usage of 'você' differs between the two. There must be many similar examples.

cf https://github.com/MarekCichy/pt-br-classifier