Search code examples
pythonweb-scrapingbeautifulsoup

How to scrape next sibling data using BeautifulSoup4 in Python?


I'm trying to pull a row called 'Basic EPS' using the below Python script via the following URL: https://finance.yahoo.com/quote/AAPL/financials

#!/usr/bin/env python3
import os, pandas as pd
from os import chdir
#Web scraping
from bs4 import BeautifulSoup
import urllib.request as ul

chdir(os.getcwd()+"/Data")
colist="AAPL"

header = {'User-Agent': 'Mozilla/5.0'} #prevents a 403 error
req=ul.Request("https://finance.yahoo.com/quote/"+colist+"/financials", headers=header)
page = ul.urlopen(req)
soup=BeautifulSoup(page, "lxml")

 for div_parent in soup.find("div",class_="rowTitle svelte-1xjz32c",title="Basic EPS"):
     print(div_parent.text)
     for div_child in div_parent.find_next_siblings("div",class_="column svelte-1xjz32c"):
          print(div_child.text)

When I execute this code, it prints the word "Basic EPS" about 4x before stopping and I'm trying to get it to print the adjacent "Basic EPS" values in the div class (next to the "Basic EPS" title div class).

The raw syntax is here:

<div class="row lv-0 svelte-1xjz32c"><div class="column sticky svelte-1xjz32c"> <div class="rowTitle svelte-1xjz32c" title="Basic EPS">Basic EPS</div></div> <div class="column svelte-1xjz32c alt">6.46 </div><div class="column svelte-1xjz32c">6.16 </div><div class="column svelte-1xjz32c alt">6.15 </div><div class="column svelte-1xjz32c">5.67 </div><div class="column svelte-1xjz32c alt">3.31 </div></div>  <div class="row lv-0 svelte-1xjz32c"><div class="column sticky svelte-1xjz32c"> <div class="rowTitle svelte-1xjz32c" title="Diluted EPS">Diluted EPS</div></div> <div class="column svelte-1xjz32c alt">6.43 </div><div class="column svelte-1xjz32c">6.13 </div><div class="column svelte-1xjz32c alt">6.11 </div><div class="column svelte-1xjz32c">5.61 </div><div class="column svelte-1xjz32c alt">3.28 </div></div>

From a proper syntax encapsulation method, I'm not sure if I first need to start by searching for all instances of this line: div class="row lv-0 svelte-1xjz32c" and then work my way to identifying the "Basic EPS" title before drilling into the actual values. Any ideas or pointers? Thanks.


Solution

  • You are close, but you may have overlooked the fact that your parent is actually nested in a <div> and therefore you would have to add a .parent to find its siblings:

    div_parent = soup.find("div",title="Basic EPS").parent
    print(div_parent.get_text(strip=True))
    
    for div_child in div_parent.find_next_siblings("div"):
        print(div_child.get_text(strip=True))
    

    or use the parents parent and call stripped_strings:

    list(soup.find("div",title="Basic EPS").parent.parent.stripped_strings)
    

    to get your row as list:

    ['Basic EPS', '6.46', '6.16', '6.15', '5.67', '3.31']