This is the web scraping problem that I have encountered that I don't know how to fix.
I want to call the async function scrape_session, but I cannot call it in the main file, and it gives me the error:
error: "await" allowed only within async function
import os
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout
import time
SEASONS = list(range(2016,2023))
DATA_DIR = 'data'
STANDINGS_DIR = os.path.join(DATA_DIR, 'standings')
SCORES_DIR = os.path.join(DATA_DIR, 'scores')
async def get_html(url,selector,sleep=5, retries = 3):
html = None
for i in range(1, retries+1):
time.sleep(sleep * i)
try:
async with async_playwright() as p:
browser = await p.firefox.launch()
page = await browser.new_page()
await page.goto(url)
print(await page.title())
html = await page.inner_html(selector)
except PlaywrightTimeout:
print(f'Timeout error on{url}')
continue
else:
break
return html
async def scrape_season(season):
url = f'https://www.basketball-reference.com/leagues/NBA_{season}_games.html'
html = await get_html(url, '#content .filter')
soup = BeautifulSoup(html)
links = soup.find_all('a')
href = [l['href'] for l in links]
standings_pages = [f"https://basketball-reference.com{l}" for l in href]
for url in standings_pages:
save_path = os.path.join(STANDINGS_DIR, url.split("/")[-1])
if os.path.exists(save_path):
continue
html = await get_html(url, '#all_schedule')
with open(save_path, 'w+') as f:
f.write(html)
for season in SEASONS:
await(scrape_season(season))
The problem with this code is that it tries to await in top-level code. This is not allowed. You need to call await inside an async function only.
Async/await is effectively just a library you could have also written yourself. It does not do any magic with the interpreter.
However, to answer your question, replacing the for loop at the end of your code with this should do the trick. Please read up on how asyncio works to understand why your code did not work and this (I hope, I did not test it,) does: https://docs.python.org/3/library/asyncio.html
import asyncio
async def main():
seasons = [scrape_season(season) for season in SEASONS]
await asyncio.gather(seasons)
asyncio.run(main())