Search code examples
sqlalgorithmrartificial-intelligencepattern-recognition

Which algorithm I can use to find common adjacent words/ pattern recognition?


I have a big table in my database with a lot of words from various texts in the text order. I want to find the number of times/frequency that some set of words appears together.

Example: Supposing I have this 4 words in many texts: United | States | of | America. I will get as result:

United States: 50
United States of: 45
United States of America: 40

(This is only an example with 4 words, but can there are with less and more than 4).

There is some algorithm that can do this or similar to this?

Edit: Some R or SQL code showing how to do is welcome. I need a practical example of what I need to do.

Table Structure

I have two tables: Token which haves id and text. The text is is UNIQUE and each entrance in this table represents a different word.

TextBlockHasToken is the table which keeps the text order. Each row represents a word in a text.

It haves textblockid that is the block of the text the token belongs. sentence that is the sentence of the token, position that is the token position inside the sentence and tokenid that is the token table reference.


Solution

  • It is called an N-gram; in your case a 4-gram. It can indeed be obtained as the by-product of a Markov-chain, but you could also use a sliding window (size 4) to walk through the (linear) text while updating a 4-dimensionsal "histogram".

    UPDATE 2011-11-22: A markov chain is a way to model the probability of switching to a new state, given the current state. This is the stochastic equivalent of a "state machine". In the natural language case, the "state" is formed by the "previous N words", which implies that you consider the prior probability (before the previous N words) as equal_to_one. Computer people will most likely use a tree for implementing Markov chains in the NLP case. The "state" is simply the path taken from the root to the current node, and the probabilities of the words_to_follow are the probabilities of the current node's offspring. But every time that we choose a new child node, we actually shift down the tree, and "forget" the root node, out window is only N words wide, which translates to N levels deep into the tree.

    You can easily see that if you are walking a Markov chain/tree like this, at any time the probability before the first word is 1, the probability after the first word is P(w1), after the second word = P(w2) || w1, etc. So, when processing the corpus you build a Markov tree ( := update the frequencies in the nodes), at the end of the ride you can estimate the probability of a given choice of word by freq(word) / SUM(freq(siblings)). For a word 5-deep into the tree this is the probability of the word, given the previous 4 words. If you want the N-gram probabilities, you want the product of all the probabilities in the path from the root to the last word.