I am trying to implement Personality, Gender, and Age in the Language of Social Media equation in an android Application on a List of String, By using Pattern and Matcher to find the matching words in the pattern.
I have 5 patterns and one list of 100 strings = 900 words. The result of find a Match in the 900 to the patterns were : 16 , 25, 5, 50, 10 words on each pattern respectively.
All that has been done, I am currently stuck in Implementing the equations mentioned in the Article to the data i got, So I can get values which can be converted to charts.
for each phrase or word you have to calculate all the 3 formula.
The first equation gives you the pointwise mutual information of a phrase.
Lets say phrase = "best of luck"
So pmi = log(probability("best of luck")/(probability("best")x probability("of")x probability("luck")))
So pmi is the log to the base 10 of the ratio of (probability of phrase : multiplication of individual words probabilities)
Second equation is the probability of phrase occurring in the subject and you can calculate it by division of (frequency of phrase used in the subject text) by (sum of the frequency of each phrase in the subject text)
For example, if the subject text is "Best of luck. You have long way to go.Best of luck."
Phrase = "Best of luck". The text has two phrases.
So pmi(phrase = "Best of luck") = frequency of "Best of luck"/(frequency("Best of luck")+(frequency of "You have long way to go"))
= 2/(2+1)
= 2/3
Third equation gives you the Anscombe transformed “relative frequencies” of words or phrases and you calculate it by the 2 multiplied by the square root of the (3/8 + output of 2nd equation)
= 2 x square root of (3/8 + 2/3)
= 2 x square root of 1.041
= 2 x 1.0202941
= 2.04