'difficult' Determine proximity between 2 strings in python

I have 2 strings loss of gene and aquaporin protein. In a line, I want to find if these two exist in a line of my file, within a proximity of 5 words. Any ideas? I have searched extensively but cannot find anything. Also, since these are multi-word strings, I cannot use abs(array.index) for the two (which was possible with single words).

Thanks

Solution

You could try the following approach:

First sanitise your text by converting it to lowercase, keeping only the characters and enforcing one space between each word.
Next, search for each of the phrases in the resulting text and keep a note of the starting index and the length of the phrase matched. Sort this index list.
Next make sure that all of the phrases were present in the text by making sure all found indexes are not -1.
If all are found count the number of words between the end of the first phrase, and the start of the last phrase. To do this take a text slice starting from the end of the first phrase to the start of the second phrase, and split it into words.

Script as follows:

import re

text = "The  Aquaporin protein, sometimes  'may' exhibit a big LOSS of gene."
text = ' '.join(re.findall(r'\b(\w+)\b', text.lower()))

indexes = sorted((text.find(x), len(x)) for x in ['loss of gene', 'aquaporin protein'])

if all(i[0] != -1 for i in indexes) and len(text[indexes[0][0] + indexes[0][1] : indexes[-1][0]].split()) <= 5:
    print "matched"

To extend this to work on a file with a list of phrases, the following approach could be used:

import re

log = 'loss of gene'
phrases = ['aquaporin protein', 'another protein']

with open('input.txt') as f_input:
    for number, line in enumerate(f_input, start=1):
        # Sanitise the line
        text = ' '.join(re.findall(r'\b(\w+)\b', line.lower()))

        # Only process lines containing 'loss of gene'
        log_index = text.find(log)

        if log_index != -1:
            for phrase in phrases:
                phrase_index = text.find(phrase)

                if phrase_index != -1:
                    if log_index < phrase_index:
                        start, end = (log_index + len(log), phrase_index)
                    else:
                        start, end = (phrase_index + len(phrase), log_index)

                    if len(text[start:end].split()) <= 5:
                        print "line {} matched - {}".format(number, phrase)
                        break

This would give you the following kind of output:

line 1 matched - aquaporin protein
line 5 matched - another protein

Note, this will only spot one phrase pair per line.