Search code examples
twitternlp

Algorithm for keyword/phrase trend search similar to Twitter trends


Wanted some ideas about building a tool which can scan text sentences (written in english language) and build a keyword rank, based on the most occurrences of words or phrases within the texts.

This would be very similar to the twitter trends wherin twitter detects and reports the top 10 words within the tweets.

I have identified the high level steps in the algorithm as follows

  1. Scan the text and remove all the common , frequent words ( such as, "the" , "is" , "are", "what" , "at" etc..)
  2. Add the remaining words to a hashmap. If the word is already in the map then increment its count.
  3. To get the top 10 words , iterate through the hashmap and find out the top 10 counts.

Step 2 and 3 are straightforward but I do not know in step 1 how do I detect the important words within a text and segregate them from the common words (prepositions, conjunctions etc )

Also if I want to track phrases what could be the approach ? For example if I have a text saying "This honey is very good" I might want to track "honey" and "good" but I may also want to track the phrases "very good" or "honey is very good"

Any suggestions would be greatly appreciated.

Thanks in advance


Solution

  • Actually, your step 1 would be quite similar to step 3 in the sense that you may want to constitute an absolute database of the most common words in the English language in the first place. Such a list is available easily on the internet (Wikipedia even has an article referencing the 100 most common words in the English language.) You can store those words in a hashmap and while scanning your text contents just ignore the common tokens.

    If you don't trust Wikipedia and the already existing listing for common words, you can build your own database. For that purpose, just scan thousands of tweets (the more the better) and make your own frequency chart.

    You're facing an n-gram-like problem.

    Do not reinvent the wheel. What you seem to be wanting to do has been done thousands of times, just use existing libs or pieces of code (check the External Links section of the n-gram Wikipedia page.)