Search code examples
pythonnlpnltkn-gram

Extract ngrams that are common for several sentences


I'm trying to extract ngrams that are common for several sentences.

What I have:
My data consists of separate sentences:

data = [
    'Design and architecture project',
    'Web inquiry for products',
    'Software for non-profit project',
    'Web inquiry for vendors'
]

What I do:

from nltk import everygrams
from collections import Counter

full_corpus = ' '.join(data)
ngram_counts = Counter(everygrams(full_corpus.split(), 2, 5))
repeated_ngrams = [(' '.join(ngram), count) for ngram, count in list(ngram_counts.items()) if count > 1]
print(repeated_ngrams)

What I get:
[('project Web', 2), ('project Web inquiry', 2), ('project Web inquiry for', 2), ('Web inquiry', 2), ('Web inquiry for', 2), ('inquiry for', 2)]

Problem with what I get now:
With this approach some of ngrams consist of words from different sentences. Entries ('project Web', 2), ('project Web inquiry', 2), ('project Web inquiry for', 2) are considered incorrect.

What I want to get:
[('Web inquiry', 2), ('Web inquiry for', 2), ('inquiry for', 2)]

Can anybody please help me with this? It seems like a common task, but I was unable to find any information about a proper solution.

Goal behind the task:
I'm going to use extracted ngrams to suggest autocomplete for users. It would be wise to suggest word 'for' when user types in 'inquiry', for example. However, if, based upon currently extracted ngrams, word 'web' will be suggested after user types the word 'project', that will be wrong.


Update: I guess I have a solution, but I'm not sure if it is optimal:

end_token = "<end>"
marked_data = [f"{entry} {end_token}" for entry in data]

full_corpus = " ".join(marked_data)
ngram_counts = Counter(everygrams(full_corpus.split(), 2, 5))
repeated_ngrams = [
    (" ".join(ngram), count)
    for ngram, count in list(ngram_counts.items())
    if count > 1 and end_token not in ngram
]
print(repeated_ngrams)

Solution

  • One possible solution is to run everygrams for each sentence separately:

    ...
    import itertools
    ...
    
    ngram_counts = Counter(itertools.chain.from_iterable(
      [everygrams(el.split(), 2, 5) for el in data]))
    ...