Im trying to scrape results of NHL games using beautifulsoup, but I am having trouble figuring out how to get the dates when the games were played and the results in order. Dates of the games are under tags and results are in class "field-content". Currently I am able to find both of the values and place them in independent variables but I would like to keep the order they appear in the original website and place the data in a single variable.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen("https://www.jatkoaika.com/nhl/ottelut").read()
soup = bs.BeautifulSoup(sauce, features="html.parser")
dates = str(soup.find_all("h3"))
dates = dates.replace("<h3>", "").replace("</h3>", "")
games = str(soup.find_all("span", {"class": "field-content"}))
games = games.replace('<span class="field-content">', "").replace("</span>", "")
The difficulty in parsing this site is the missing hierarchy of header elements and the games you want to parse. They are all contents of the same element.
Use the following CSS selector to get the h3 element and the spans with field-content class into one array
games = soup.select("h3, span.field-content")
The output:
[<h3>Ma 28.10.2019 runkosarja</h3>,
<span class="field-content">Chicago - Los Angeles</span>,
<span class="field-content">NY Islanders - Philadelphia</span>,
<span class="field-content">NY Rangers - Boston</span>,
<span class="field-content">Ottawa - San Jose</span>,
<span class="field-content">Vegas - Anaheim</span>,
<h3>Ti 29.10.2019 runkosarja</h3>,
...
]
Now you can use the following code to group the game to the date
from collections import defaultdict
dates_with_games = defaultdict(list)
for e in games:
if (e.name == 'h3'):
latestH3 = e.text
else:
dates_with_games[latestH3].append(e.text)
You get a dictionary that looks like this
{'Ma 28.10.2019 runkosarja':
['Chicago - Los Angeles',
'NY Islanders - Philadelphia',
'NY Rangers - Boston',
'Ottawa - San Jose',
'Vegas - Anaheim'],
'Ti 29.10.2019 runkosarja':
['Buffalo - Arizona',
'Vancouver - Florida'],...
}