I'm trying to extract ngrams that are common for several sentences.
What I have:
My data consists of separate sentences:
data = [
'Design and architecture project',
'Web inquiry for products',
'Software for non-profit project',
'Web inquiry for vendors'
]
What I do:
from nltk import everygrams
from collections import Counter
full_corpus = ' '.join(data)
ngram_counts = Counter(everygrams(full_corpus.split(), 2, 5))
repeated_ngrams = [(' '.join(ngram), count) for ngram, count in list(ngram_counts.items()) if count > 1]
print(repeated_ngrams)
What I get:
[('project Web', 2), ('project Web inquiry', 2), ('project Web inquiry for', 2), ('Web inquiry', 2), ('Web inquiry for', 2), ('inquiry for', 2)]
Problem with what I get now:
With this approach some of ngrams consist of words from different sentences. Entries ('project Web', 2), ('project Web inquiry', 2), ('project Web inquiry for', 2)
are considered incorrect.
What I want to get:
[('Web inquiry', 2), ('Web inquiry for', 2), ('inquiry for', 2)]
Can anybody please help me with this? It seems like a common task, but I was unable to find any information about a proper solution.
Goal behind the task:
I'm going to use extracted ngrams to suggest autocomplete for users. It would be wise to suggest word 'for'
when user types in 'inquiry'
, for example.
However, if, based upon currently extracted ngrams, word 'web'
will be suggested after user types the word 'project'
, that will be wrong.
Update: I guess I have a solution, but I'm not sure if it is optimal:
end_token = "<end>"
marked_data = [f"{entry} {end_token}" for entry in data]
full_corpus = " ".join(marked_data)
ngram_counts = Counter(everygrams(full_corpus.split(), 2, 5))
repeated_ngrams = [
(" ".join(ngram), count)
for ngram, count in list(ngram_counts.items())
if count > 1 and end_token not in ngram
]
print(repeated_ngrams)
One possible solution is to run everygrams
for each sentence separately:
...
import itertools
...
ngram_counts = Counter(itertools.chain.from_iterable(
[everygrams(el.split(), 2, 5) for el in data]))
...