Search code examples
pythonpandasnlpnltk

How can I look for specific bigrams in text example - python?


I am interested in finding how often (in percentage) a set of words, as in n_grams appears in a sentence.

example_txt= ["order intake is strong for Q4"]

def find_ngrams(text):
    text = re.findall('[A-z]+', text)
    content = [w for w in text if w.lower() in n_grams] # you can calculate %stopwords using "in"
    return round(float(len(content)) / float(len(text)), 5)

#the goal is for the above procedure to work on a pandas datafame, but for now lets use 'text' as an example.
#full_MD['n_grams'] = [find_ngrams(x) for x in list(full_MD.loc[:,'text_no_stopwords'])]

Below you see two examples. The first one works, the last doesn't.

n_grams= ['order']
res = [find_ngrams(x) for x in list(example_txt)]
print(res)
Output:
[0.16667]

n_grams= ['order intake']
res = [find_ngrams(x) for x in list(example_txt)]
print(res)
Output:
[0.0]

How can I make the find_ngrams() function process bigrams, so the last example from above works?

Edit: Any other ideas?


Solution

  • You can use SpaCy Matcher:

    import spacy
    from spacy.matcher import Matcher
    
    nlp = spacy.load("en_core_web_sm")
    matcher = Matcher(nlp.vocab)
    # Add match ID "orderintake" with no callback and one pattern
    pattern = [{"LOWER": "order"}, {"LOWER": "intake"}]
    matcher.add("orderintake", None, pattern)
    
    doc = nlp("order intake is strong for Q4")
    matches = matcher(doc)
    print(len(matches)) #Number of times the bi-gram appears in text