Search code examples
pythonweb-scrapingbeautifulsoup

How to scrape a website for multiple values that need to be ordered


Im trying to scrape results of NHL games using beautifulsoup, but I am having trouble figuring out how to get the dates when the games were played and the results in order. Dates of the games are under tags and results are in class "field-content". Currently I am able to find both of the values and place them in independent variables but I would like to keep the order they appear in the original website and place the data in a single variable.

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen("https://www.jatkoaika.com/nhl/ottelut").read()

soup = bs.BeautifulSoup(sauce, features="html.parser")

dates = str(soup.find_all("h3"))
dates = dates.replace("<h3>", "").replace("</h3>", "")

games = str(soup.find_all("span", {"class": "field-content"}))
games = games.replace('<span class="field-content">', "").replace("</span>", "")

Solution

  • The difficulty in parsing this site is the missing hierarchy of header elements and the games you want to parse. They are all contents of the same element.

    Use the following CSS selector to get the h3 element and the spans with field-content class into one array

    games = soup.select("h3, span.field-content")
    

    The output:

    [<h3>Ma 28.10.2019 runkosarja</h3>,
     <span class="field-content">Chicago - Los Angeles</span>,
     <span class="field-content">NY Islanders - Philadelphia</span>,
     <span class="field-content">NY Rangers - Boston</span>,
     <span class="field-content">Ottawa - San Jose</span>,
     <span class="field-content">Vegas - Anaheim</span>,
     <h3>Ti 29.10.2019 runkosarja</h3>,
     ...
    ]
    

    Now you can use the following code to group the game to the date

    from collections import defaultdict
    dates_with_games = defaultdict(list)
    for e in games:
        if (e.name == 'h3'):
            latestH3 = e.text
        else:
            dates_with_games[latestH3].append(e.text)
    

    You get a dictionary that looks like this

     {'Ma 28.10.2019 runkosarja': 
      ['Chicago - Los Angeles',
       'NY Islanders - Philadelphia',
       'NY Rangers - Boston',
       'Ottawa - San Jose',
       'Vegas - Anaheim'],
      'Ti 29.10.2019 runkosarja': 
        ['Buffalo - Arizona',
         'Vancouver - Florida'],...
     }