Search code examples
pythonhtmlweb-scrapingbeautifulsouprecompile

Find a tag using text it contains using BeautifulSoup


I am trying to webscrape some parts of this page: https://markets.businessinsider.com/stocks/bp-stock using BeautifulSoup to search for some text contained in h2 title of tables

when i do:

data_table = soup.find('h2', text=re.compile('RELATED STOCKS')).find_parent('div').find('table')

It correctly get the table I am after.

When I try to get the table "Analyst Opinion" using the similar line, it returns None:

data_table = soup.find('h2', text=re.compile('ANALYST OPINIONS')).find_parent('div').find('table')

I am guessing that there might be some special characters in the html code, that provides re to function as expected. I tried this too:

data_table = soup.find('h2', text=re.compile('.*?STOCK.*?INFORMATION.*?', re.DOTALL))

without success.

I would like to get the table that contain this bit of text "Analyst Opinion" without finding all tables but by checking if contains my requested text.

Any idea will be highly appreciated. Best


Solution

  • You can use CSS selector to locate the <table>:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://markets.businessinsider.com/stocks/bp-stock '
    
    soup = BeautifulSoup(requests.get(url).text, 'lxml')
    
    table = soup.select_one('div:has(> h2:contains("Analyst Opinions")) table')
    
    for tr in table.select('tr'):
        print(tr.get_text(strip=True, separator=' '))
    

    Prints:

    2/26/2018 BP Outperform RBC Capital Markets
    9/22/2017 BP Outperform BMO Capital Markets
    

    More about CSS selectors here.


    EDIT: For canse-insensitive method, you can use bs4 API with regular expressions (note the flags=re.I). This is the equivalent of .select() method above:

    import re
    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://markets.businessinsider.com/stocks/bp-stock '
    
    soup = BeautifulSoup(requests.get(url).text, 'lxml')
    
    h2 = soup.find(lambda t: t.name=='h2' and re.findall('analyst opinions', t.text, flags=re.I))
    table = h2.find_parent('div').find('table')
    
    for tr in table.select('tr'):
        print(tr.get_text(strip=True, separator=' '))