Search code examples
pythonweb-scrapingbeautifulsouplxmlpython-webbrowser

Matching a specific piece of text in a title using Beuatiful Soup


Basically, I want to find all links that contain certain key terms. In my case, the titles of these links that I want come in this form: abc... (common text), dce... (common text), ... I want to take all of the links containing "(common text)" and put them in the list. I got the code working and I understand how to find all links. However, I converted the links to strings to find the "(common text)". I know that this isn't good practice and I am not sure how to use Beautiful Soup to find this common element without converting to a string. The issue here is that the titles I am searching for are not all the same. Here's what I have so far:

 from bs4 import BeautifulSoup
 import requests
 import webbrowser

 url = 'website.com'
 http = requests.get(url)

 soup = BeautifulSoup(http.content, "lxml")

 links = soup.find_all('a', limit=4000)
 links_length = len(links)

 string_links = []
 targetlist = []
 
 for a in range(links_length):
       string_links.append(str(links[a]))  
       if '(common text)' in string_links[a]:
             targetlist.append(string_links[a])

NOTE: I am looking for the simplest method using Beautiful Soup to accomplish this. Any help will be appreciated.


Solution

  • Without the actual website and actual output you want, it's very difficult to say what you want but this is a "cleaner" solution using list comprehension.

    from bs4 import BeautifulSoup
    import requests
    import webbrowser
    
    url = 'website.com'
    http = requests.get(url)
    
    soup = BeautifulSoup(http.content, "lxml")
    
    links = soup.find_all('a', limit=4000)
    
    targetlist = [str(link) for link in links if "(common text)" in str(link)]