Search code examples
pythonpython-2.7numbersextractwords

I want to extract a certain number of words surrounding a given word in a long string(paragraph) in Python 2.7


I am trying to extract a selected number of words surrounding a given word. I will give example to make it clear:

string = "Education shall be directed to the full development of the human personality and to the strengthening of respect for human rights and fundamental freedoms."

1) The selected word is development and I need to get the 6 words surrounding it, and get : [to, the, full, of, the, human]


2) But if the selected word is in the beginning or in second position I still need to get 6 words, e.g:

The selected word is shall , I should get: [Education, be, directed, to , the , full]

I should use 're' module. What I managed to find until now is :

def search(text,n):
'''Searches for text, and retrieves n words either side of the text, which are retuned seperatly'''
word = r"\W*([\w]+)"
groups = re.search(r'{}\W*{}{}'.format(word*n,'place',word*n), text).groups()
return groups[:n],groups[n:]

but it helps me only with the first case. Can someone help me out with this, I will be really grateful. Thank you in advance!


Solution

  • This will extract all occurrences of the target word in your text, with context:

    import re
    
    text = ("Education shall be directed to the full development of the human personality "
            "and to the strengthening of respect for human rights and fundamental freedoms.")
    
    def search(target, text, context=6):
        # It's easier to use re.findall to split the string, 
        # as we get rid of the punctuation
        words = re.findall(r'\w+', text)
    
        matches = (i for (i,w) in enumerate(words) if w.lower() == target)
        for index in matches:
            if index < context //2:
                yield words[0:context+1]
            elif index > len(words) - context//2 - 1:
                yield words[-(context+1):]
            else:
                yield words[index - context//2:index + context//2 + 1]
    
    print(list(search('the', text)))
    # [['be', 'directed', 'to', 'the', 'full', 'development', 'of'], 
    #  ['full', 'development', 'of', 'the', 'human', 'personality', 'and'], 
    #  ['personality', 'and', 'to', 'the', 'strengthening', 'of', 'respect']]
    
    print(list(search('shall', text)))
    # [['Education', 'shall', 'be', 'directed', 'to', 'the', 'full']]
    
    print(list(search('freedoms', text)))
    # [['respect', 'for', 'human', 'rights', 'and', 'fundamental', 'freedoms']]