I'm trying to scrape some data from a website where I need to be logged in to see the actual content. It all works fine but takes about 5 seconds per request which is way to slow for my needs (>5000 urls to scrape from). It seems there are faster ways like asyncio aiohttp modules. However all examples I found on the web did not show how to login to a site and then use these tools.
So I basically need an easy to follow example how to do such a thing.
I tried to rebuild this example: https://realpython.com/python-concurrency/#what-is-concurrency with my code, which did not work. I also tried AsyncHTMLSession() from requests_html which returned something but did not seem to remember the login.
This is my code so far:
import requests
from bs4 import BeautifulSoup
payload = {
"name" : "username",
"password" : "example_pass",
"destination" : "MAS_Management_UserConsole",
"loginType" : ""
}
links = [several urls]
### stuff with requests
with requests.Session() as c:
c.get('http://boldsystems.org/')
c.post('http://boldsystems.org/index.php/Login', data = payload)
def return_id(link):
page = c.get(link).content
soup = BeautifulSoup(page, 'html.parser')
return soup.find(id = 'processidLC').text
for link in links:
print(return_id(link))
It looks like you're already using requests
so you can try requests-async. The example below should help you with "in reasonable time" part of your question, just adjust parse_html
function accordingly to search for your HTML tag. By default it will run 50 requests in parallel (MAX_REQUESTS
) to not exhaust resources on your system (file descriptors etc.).
Example:
import asyncio
import requests_async as requests
import time
from bs4 import BeautifulSoup
from requests_async.exceptions import HTTPError, RequestException, Timeout
MAX_REQUESTS = 50
URLS = [
'http://envato.com',
'http://amazon.co.uk',
'http://amazon.com',
'http://facebook.com',
'http://google.com',
'http://google.fr',
'http://google.es',
'http://google.co.uk',
'http://internet.org',
'http://gmail.com',
'http://stackoverflow.com',
'http://github.com',
'http://heroku.com',
'http://djangoproject.com',
'http://rubyonrails.org',
'http://basecamp.com',
'http://trello.com',
'http://yiiframework.com',
'http://shopify.com',
'http://airbnb.com',
'http://instagram.com',
'http://snapchat.com',
'http://youtube.com',
'http://baidu.com',
'http://yahoo.com',
'http://live.com',
'http://linkedin.com',
'http://yandex.ru',
'http://netflix.com',
'http://wordpress.com',
'http://bing.com',
]
class BaseException(Exception):
pass
class HTTPRequestFailed(BaseException):
pass
async def fetch(url, timeout=5):
async with requests.Session() as session:
try:
resp = await session.get(url, timeout=timeout)
resp.raise_for_status()
except HTTPError:
raise HTTPRequestFailed(f'Skipped: {resp.url} ({resp.status_code})')
except Timeout:
raise HTTPRequestFailed(f'Timeout: {url}')
except RequestException as e:
raise HTTPRequestFailed(e)
return resp
async def parse_html(html):
bs = BeautifulSoup(html, 'html.parser')
if not html: print(html)
title = bs.title.text.strip()
return title if title else "Unknown"
async def run(sem, url):
async with sem:
start_t = time.time()
resp = await fetch(url)
title = await parse_html(resp.text)
end_t = time.time()
elapsed_t = end_t - start_t
r_time = resp.elapsed.total_seconds()
print(f'{url}, title: "{title}" (total: {elapsed_t:.2f}s, request: {r_time:.2f}s)')
return resp
async def main():
sem = asyncio.Semaphore(MAX_REQUESTS)
tasks = [asyncio.create_task(run(sem, url)) for url in URLS]
for f in asyncio.as_completed(tasks):
try:
result = await f
except Exception as e:
print(e)
if __name__ == '__main__':
asyncio.run(main())
Output:
# time python req.py
http://google.com, title: "Google" (total: 0.69s, request: 0.58s)
http://yandex.ru, title: "Яндекс" (total: 2.01s, request: 1.65s)
http://github.com, title: "The world’s leading software development platform · GitHub" (total: 2.12s, request: 1.90s)
Timeout: http://yahoo.com
...
real 0m6.868s
user 0m3.723s
sys 0m0.524s
Now, this may still not help you with your logging issue. The HTML tag that you're looking for (or the entire web page) could be generated by JavaScript so you'll need tools like requests-html
that is using a headless browser to read content rendered by JavaScript.
It's also possible that your login form is using CSRF protection, example with login to Django admin backend:
>>> import requests
>>> s = requests.Session()
>>> get = s.get('http://localhost/admin/')
>>> csrftoken = get.cookies.get('csrftoken')
>>> payload = {'username': 'admin', 'password': 'abc123', 'csrfmiddlewaretoken': csrftoken, 'next': '/admin/'}
>>> post = s.post('http://localhost/admin/login/?next=/admin/', data=payload)
>>> post.status_code
200
We use session to perform a get request first, to get the token from the csrftoken
cookie and then we login with two hidden form fields:
<form action="/admin/login/?next=/admin/" method="post" id="login-form">
<input type="hidden" name="csrfmiddlewaretoken" value="uqX4NIOkQRFkvQJ63oBr3oihhHwIEoCS9350fVRsQWyCrRub5llEqu1iMxIDWEem">
<div class="form-row">
<label class="required" for="id_username">Username:</label>
<input type="text" name="username" autofocus="" required="" id="id_username">
</div>
<div class="form-row">
<label class="required" for="id_password">Password:</label> <input type="password" name="password" required="" id="id_password">
<input type="hidden" name="next" value="/admin/">
</div>
<div class="submit-row">
<label> </label>
<input type="submit" value="Log in">
</div>
</form>
Note: examples are using Python 3.7+