Search code examples
pythonwebbeautifulsoupscreen-scraping

Seperating the information from output in my scraping code (beautifulsoup + python)


The profile I am scraping is https://lawyers.justia.com/lawyer/robin-d-gross-39828 . I am getting both Education and Professional Associations printing out together, how can I separate these two?

for item in soup.find_all("dl", {"class": "description-list list-with-badges"}):
    y = item.find_all("span",attrs={"itemprop":"name"})
    if y:
        print("Education:", item.get_text(strip=True, separator= '|').split('|'))

Output is:

Education: ['Santa Clara University School of Law', 'J.D. ', '  Law', '1998', 'Honors:', 'Awarded "Certificate in High Technology Law"', 'Activities:', 'Editor, Santa Clara Computer & High Technology Law Journal;  Editor-in-Chief, The Advocate, Santa Clara University Law School Newspaper.']
Education: ['Michigan State University, James Madison College', 'B.A. ', '  Political Philosophy', '1995', 'Honors:', 'Overseas Study Program in Caribbean and South America, Summer Semester 1994Vice-President, MSU Adventure Club']
Education: ['Michigan State University, James Madison College', 'B.A. ', '  International Relations', '1995']
Education: ['California State Bar', '# 200701', 'Member', 'Current']
Education: ['California Bar Association', 'Member', 'Current']
Education: ['San Francisco Bar Association', 'Member', 'Current']
Education: ['American Bar Association', 'Member', 'Current']
Education: ['Internet Corporation for Assigned Names and Numbers (ICANN) - Noncommercial Stakeholders Group', 'Executive Committee', '2010', '- Current']
Education: ['Executive Committee of FreeMuse', 'Member', '2009', '-', '2016']
Education: ['Public Interest Registry - Advisory Council', 'Member', '2012', '-', '2014']

Solution

  • You're using "class": "description-list list-with-badges" to fetch your items. If you look at the code, you'll see that both items in Education and Professional Associations have these classes.

    If you want to catch them separately, you could use the itemtype tag. http://schema.org/CollegeOrUniversity is the Education tag's value and http://schema.org/Organization for Professional Associations.