Search code examples
pythonweb-scrapingbeautifulsoup

Is there a better way to structure this scrape?


Still learning how to webscrape using BeautifulSoup and Python. I have come up with this to grab the professional experience from this website https://lawyers.justia.com/lawyer/ali-shahrestani-esq-198352.

for item in soup.findAll("dl",attrs={"class":"description-list list-with-badges"}):
    x=item.findAll("strong")
    x=remove_tags(str(x))
    print(x)

Output:

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[Attorney]
[]
[]
[]
[]
[]
[]
[]
[]
[]

I also am looking to get the information under "Attorney" but I am struggling.


Solution

  • You can filter data using if x: and later you can do different things with item

    for item in soup.find_all("dl", {"class": "description-list list-with-badges"}):
        x = item.find_all("strong")
        if x:
            print('strong:', x[0].get_text(strip=True))
            print('text:', item.get_text(strip=True, separator='|'))
            print('list:', item.get_text(strip=True, separator='|').split('|'))
    

    Result

    strong: Attorney
    text: Attorney|Ali Shahrestani, Esq.|2007|- Current
    list: ['Attorney', 'Ali Shahrestani, Esq.', '2007', '- Current']
    

    Or you can try to use other unique values - ie.

    <strong itemprop='jobTitle'>
    

    and use other function - ie parent

    data = soup.find('strong', {'itemprop': 'jobTitle'}).parent.parent
    print('text:', data.get_text(strip=True, separator='|'))
    print('list:', data.get_text(strip=True, separator='|').split('|'))
    

    Result:

    text: Attorney|Ali Shahrestani, Esq.|2007|- Current
    list: ['Attorney', 'Ali Shahrestani, Esq.', '2007', '- Current']
    

    Full example

    import requests
    from bs4 import BeautifulSoup as BS
    
    url = 'https://lawyers.justia.com/lawyer/ali-shahrestani-esq-198352'
    r = requests.get(url)
    
    soup = BS(r.text, 'html.parser')
    
    for item in soup.find_all("dl", {"class": "description-list list-with-badges"}):
        x = item.find_all("strong")
        if x:
            print('strong:', x[0].get_text(strip=True))
            print('text:', item.get_text(strip=True, separator='|'))
            print('list:', item.get_text(strip=True, separator='|').split('|'))
    
    print('---')
    
    item = soup.find('strong', {'itemprop': 'jobTitle'}).parent.parent
    print('text:', item.get_text(strip=True, separator='|'))
    print('list:', item.get_text(strip=True, separator='|').split('|'))