I have a text document that has multiple paragraphs. I need to find frequently occurring phrases together.
For example
Patient name xyz phone no 12345 emailid [email protected] Patient name abc address some us address
Comparing these lines the common phrase is Patient name. Now I can have the phrase anywhere in the paragraph. Now my requirement is to find the most frequently occurring phrases in the document irrespective of its position using nlp.
You should use n-grams
for that matter so you just count the number of times a sequence of contiguous n
words appear. Because you don't know how many words will be repeating you can try several n
for n-grams
, ie. from 2 to 6.
Java ngrams example tested on JDK 1.8.0
:
import java.util.*;
public class NGramExample{
public static HashMap<String, Integer> ngrams(String text, int n) {
ArrayList<String> words = new ArrayList<String>();
for(String word : text.split(" ")) {
words.add(word);
}
HashMap<String, Integer> map = new HashMap<String, Integer>();
int c = words.size();
for(int i = 0; i < c; i++) {
if((i + n - 1) < c) {
int stop = i + n;
String ngramWords = words.get(i);
for(int j = i + 1; j < stop; j++) {
ngramWords +=" "+ words.get(j);
}
map.merge(ngramWords, 1, Integer::sum);
}
}
return map;
}
public static void main(String []args){
System.out.println("Ngrams: ");
HashMap<String, Integer> res = ngrams("Patient name xyz phone no 12345 emailid [email protected]. Patient name abc address some us address", 2);
for (Map.Entry<String, Integer> entry : res.entrySet()) {
System.out.println(entry.getKey() + ":" + entry.getValue().toString());
}
}
}
The output:
Ngrams:
name abc:1
[email protected]. Patient:1
emailid [email protected].:1
phone no:1
12345 emailid:1
Patient name:2
xyz phone:1
address some:1
us address:1
name xyz:1
some us:1
no 12345:1
abc address:1
So you see how 'Patient name' has the max count, 2 times. You could use this function with several n
values and retrieve the max occurrences.
Edit: I will leave this Python code here for historic reasons.
A simple Python (using nltk
) working example to show you what I mean:
from nltk import ngrams
from collections import Counter
paragraph = 'Patient name xyz phone no 12345 emailid [email protected]. Patient name abc address some us address'
n = 2
words = paragraph.split(' ') # of course you should split sentences in a better way
bigrams = ngrams(words, n)
c = Counter(bigrams)
c.most_common()[0]
This gives you the output:
>> (('Patient', 'name'), 2)