Search code examples
pythonnlpspacy

How to extract sentences with key phrases in spaCy


I have worked with Spacy and so far, found very intuitative and robust in NLP. I am trying to make out of text sentences search which is both ways word base as well as content type base search but so far, I would not find any solution with spacy.

I have the text like:

In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals. Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.[1] Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".[2]

As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect.[3] A quip in Tesler's Theorem says "AI is whatever hasn't been done yet."[4] For instance, optical character recognition is frequently excluded from things considered to be AI,[5] having become a routine technology.[6] Modern machine capabilities generally classified as AI include successfully understanding human speech,[7] competing at the highest level in strategic game systems (such as chess and Go),[8] autonomously operating cars, intelligent routing in content delivery networks, and military simulations[9].

Artificial intelligence was founded as an academic discipline in 1955, and in the years since has experienced several waves of optimism,[10][11] followed by disappointment and the loss of funding (known as an "AI winter"),[12][13] followed by new approaches, success and renewed funding.[11][14] For most of its history, AI research has been divided into sub-fields that often fail to communicate with each other.[15] These sub-fields are based on technical considerations, such as particular goals (e.g. "robotics" or "machine learning"),[16] the use of particular tools ("logic" or artificial neural networks), or deep philosophical differences.[17][18][19] Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).[15]

Now, I want to extract the sentences complete in multiple with multiple words or string matching. E.g., i want to search intelligent and machine learning. and it prints all complete sentences which contain this single or both given strings.

Is there any way that importing model of spacy with spacy can sense the phrase match .. like it finds all the intelligent and machine learning containing words and print that ? and also with other option, can it also finds as with search machine learning, also suggests deep learning, artificial intelligence, pattern recognition etc?

import spacy
nlp = spacy.load("en_core_web_sm")
from spacy.matcher import PhraseMatcher
phrase_matcher = PhraseMatcher(nlp.vocab)

phrases = ['machine learning', ''intelligent, 'human']

patterns = [nlp(text) for text in phrases]

phrase_matcher.add('AI', None, *patterns)

sentence = nlp (processed_article)

matched_phrases = phrase_matcher(sentence)

for match_id, start, end in matched_phrases:
    string_id = nlp.vocab.strings[match_id]  
    span = sentence[start:end]                   
    print(match_id, string_id, start, end, span.text)

I tried this which is not providing the complete sentence but only the word with matching ID number.

in short,

  1. I am trying to search with multiple words input and find complete sentences which contain either out of input single string or all
  2. I am trying to use the trained model to also find suggested sentence out of input.

Solution

  • Part 1:

    i want to search intelligent and machine learning. and it prints all complete sentences which contain this single or both given strings.

    This is how you can find complete sentences that contain your keywords that you are looking for. Keep in mind that sentence boundaries are determined statistically, and hence, and it would work fine if the incoming paragraphs are from news or wikipedia, but it wouldn't work as well if the data is coming from social media.

    import spacy
    from spacy.matcher import PhraseMatcher
    
    text = """I like tomtom and I cannot lie. In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals.  Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its  environment and takes actions that maximize its chance of successfully achieving its goals.[1] Colloquially,  the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive"  functions that humans associate with the human mind, such as "learning" and "problem solving".[2] """
    
    nlp = spacy.load("en_core_web_sm")
    
    phrase_matcher = PhraseMatcher(nlp.vocab)
    phrases = ['machine learning', 'artificial intelligence']
    patterns = [nlp(text) for text in phrases]
    phrase_matcher.add('AI', None, *patterns)
    
    doc = nlp(text)
    
    for sent in doc.sents:
        for match_id, start, end in phrase_matcher(nlp(sent.text)):
            if nlp.vocab.strings[match_id] in ["AI"]:
                print(sent.text)
    

    Output

    In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals.  
    Colloquially,  the term "artificial intelligence" is often used to describe machines (or computers)
    

    Part 2:

    can it also finds as with search machine learning, also suggests deep learning, artificial intelligence, pattern recognition etc?

    Yes. That is very much possible, you would need to utilize a word2vec or sense2vec in order to do that.