Search code examples
pythonextractphrase

Extracting words/phrase followed by a phrase


I have one text file with a list of phrases. Below is how the file looks:

Filename: KP.txt

enter image description here

And from the below input (paragraph), I want to extract the next 2 words after the KP.txt phrase (the phrases could be anything as shown in my above KP.txt file). All I need is to extract the next 2 words.

Input:

This is Lee. Thanks for contacting me. I wanted to know the exchange policy at Noriaqer hardware services.

In the above example, I found the phrase " I wanted to know", matches with the KP.txt file content. So if I wanted to extract the next 2 words after this, my output will be like "exchange policy".

How could I extract this in python?


Solution

  • Assuming you already know how to read the input file into a list, it can be done with some help from regex.

    >>> wordlist = ['I would like to understand', 'I wanted to know', 'I wish to know', 'I am interested to know']
    >>> input_text = 'This is Lee. Thanks for contacting me. I wanted to know exchange policy at Noriaqer hardware services.'
    >>> def word_extraction (input_text, wordlist):
    ...     for word in wordlist:
    ...         if word in input_text:
    ...             output = re.search (r'(?<=%s)(.\w*){2}' % word, input_text)
    ...             print (output.group ().lstrip ())
    >>> word_extraction(input_text, wordlist)
    exchange policy
    >>> input_text = 'This is Lee. Thanks for contacting me. I wish to know where is Noriaqer hardware.'
    >>> word_extraction(input_text, wordlist)
    where is
    >>> input_text = 'This is Lee. Thanks for contacting me. I\'d like to know where is Noriaqer hardware.'
    >>> word_extraction(input_text, wordlist)
    
    >>>
    
    1. First we need to check whether the phrase we want is in the sentence. It's not the most efficient way if you have large list but it works for now.
    2. Next if it is in our "dictionary" of phrase, we use regex to extract the keyword that we wanted.
    3. Finally strip the leading white space in front of our target word.

    Regex Hint:

    • (?<=%s) is look behind assertion. Meaning check the word behind the sentence starting with "I wanted to know"
    • (.\w*){2} means any character after our phrase followed by one or more words stopping at 2 words after the key phrase.