Search code examples
pythonpython-3.xlistpython-requestspython-re

Check a list of words and return found words from page source code with a unique list


I have looked through various other questions but none seem to fit the bill. So here goes

I have a list of words

l = ['red','green','yellow','blue','orange'] 

I also have a source code of a webpage in another variable. I am using the requests lib

import requests

url = 'https://google.com'
response = requests.get(url)
source = response.content

I then created a substring lookup function like so

def find_all_substrings(string, sub):

    import re
    starts = [match.start() for match in re.finditer(re.escape(sub), string)]
    return starts

I now lookup the words using the following code where I am stuck

for word in l:
    substrings = find_all_substrings(source, word)
    new = []
    for pos in substrings:
        ok = False
        if not ok:
            print(word + ";")
            if word not in new:
                new.append(word)
                print(new)
            page['words'] = new

My ideal output looks like the following

Found words - ['red', 'green']


Solution

  • If all you want is a list of words that are present, you can avoid most of the regex processing and just use

    found_words = [word for word in target_words if word in page_content]
    

    (I've renamed your string -> page_content and l -> target_words.)

    If you need additional information or processing (e.g. the regexs / BeautifulSoup parser) and have a list of items which you need to deduplicate, you can just run it through a set() call. If you need a list instead of a set, or want to guarantee the order of found_words, just cast it again. Any of the following should work fine:

    found_words = set(possibly_redundant_list_of_found_words)
    found_words = list(set(possibly_redundant_list_of_found_words))
    found_words = sorted(set(possibly_redundant_list_of_found_words))
    

    If you've got some sort of data structure you're parsing (because BeautifulSoup & regex can provide supplemental information about position & context, and you might care about those), then just define a custom function extract_word_from_struct() which extracts the word from that structure, and call that inside a set comprehension:

    possibly_redundant_list_of_found_words = [extract_word_from_struct(struct) for struct in possibly_redundant_list_of_findings]
    found_words = set(word for word in possibly_redundant_list_of_found_words if word in target_words)