java machine-learning nlp linguistics google-natural-language

How to find freqeuntly occuring phrases in a text document

I have a text document that has multiple paragraphs. I need to find frequently occurring phrases together.

For example

Patient name xyz phone no 12345 emailid xyz@abc.com Patient name abc address some us address

Comparing these lines the common phrase is Patient name. Now I can have the phrase anywhere in the paragraph. Now my requirement is to find the most frequently occurring phrases in the document irrespective of its position using nlp.

Solution

You should use n-grams for that matter so you just count the number of times a sequence of contiguous n words appear. Because you don't know how many words will be repeating you can try several n for n-grams, ie. from 2 to 6.

Java ngrams example tested on JDK 1.8.0:

import java.util.*;

public class NGramExample{

    public static HashMap<String, Integer> ngrams(String text, int n) {
        ArrayList<String> words = new ArrayList<String>();
        for(String word : text.split(" ")) {
            words.add(word);
        }

        HashMap<String, Integer> map = new HashMap<String, Integer>();

        int c = words.size();
        for(int i = 0; i < c; i++) {
            if((i + n - 1) < c) {
                int stop = i + n;
                String ngramWords = words.get(i);

                for(int j = i + 1; j < stop; j++) {
                    ngramWords +=" "+ words.get(j);
                }
                map.merge(ngramWords, 1, Integer::sum);
            }
        }

        return map;
    }

     public static void main(String []args){
        System.out.println("Ngrams: ");
        HashMap<String, Integer> res = ngrams("Patient name xyz phone no 12345 emailid xyz@abc.com. Patient name abc address some us address", 2);
        for (Map.Entry<String, Integer> entry : res.entrySet()) {
            System.out.println(entry.getKey() + ":" + entry.getValue().toString());
        }
     }
}

The output:

Ngrams: 
name abc:1
xyz@abc.com. Patient:1
emailid xyz@abc.com.:1
phone no:1
12345 emailid:1
Patient name:2
xyz phone:1
address some:1
us address:1
name xyz:1
some us:1
no 12345:1
abc address:1

So you see how 'Patient name' has the max count, 2 times. You could use this function with several n values and retrieve the max occurrences.

Edit: I will leave this Python code here for historic reasons.

A simple Python (using nltk) working example to show you what I mean:

from nltk import ngrams
from collections import Counter

paragraph = 'Patient name xyz phone no 12345 emailid xyz@abc.com. Patient name abc address some us address'
n = 2
words = paragraph.split(' ') # of course you should split sentences in a better way
bigrams = ngrams(words, n)
c = Counter(bigrams)
c.most_common()[0]

This gives you the output:

>> (('Patient', 'name'), 2)