Search code examples
pythonasynchronouspython-requestsscreen-scrapingpython-requests-html

python - Scrape many URL's with Login in reasonable time


I'm trying to scrape some data from a website where I need to be logged in to see the actual content. It all works fine but takes about 5 seconds per request which is way to slow for my needs (>5000 urls to scrape from). It seems there are faster ways like asyncio aiohttp modules. However all examples I found on the web did not show how to login to a site and then use these tools.

So I basically need an easy to follow example how to do such a thing.

I tried to rebuild this example: https://realpython.com/python-concurrency/#what-is-concurrency with my code, which did not work. I also tried AsyncHTMLSession() from requests_html which returned something but did not seem to remember the login.

This is my code so far:

import requests
from bs4 import BeautifulSoup

payload = {
"name" : "username",
"password" : "example_pass",
"destination" : "MAS_Management_UserConsole",
"loginType" : ""
}

links = [several urls]

### stuff with requests
with requests.Session() as c:
    c.get('http://boldsystems.org/')
    c.post('http://boldsystems.org/index.php/Login', data = payload)

def return_id(link):
    page = c.get(link).content
    soup = BeautifulSoup(page, 'html.parser')
    return soup.find(id = 'processidLC').text

for link in links:
    print(return_id(link))

Solution

  • It looks like you're already using requests so you can try requests-async. The example below should help you with "in reasonable time" part of your question, just adjust parse_html function accordingly to search for your HTML tag. By default it will run 50 requests in parallel (MAX_REQUESTS) to not exhaust resources on your system (file descriptors etc.).

    Example:

    import asyncio
    import requests_async as requests
    import time
    
    from bs4 import BeautifulSoup
    from requests_async.exceptions import HTTPError, RequestException, Timeout
    
    
    MAX_REQUESTS = 50
    URLS = [
        'http://envato.com',
        'http://amazon.co.uk',
        'http://amazon.com',
        'http://facebook.com',
        'http://google.com',
        'http://google.fr',
        'http://google.es',
        'http://google.co.uk',
        'http://internet.org',
        'http://gmail.com',
        'http://stackoverflow.com',
        'http://github.com',
        'http://heroku.com',
        'http://djangoproject.com',
        'http://rubyonrails.org',
        'http://basecamp.com',
        'http://trello.com',
        'http://yiiframework.com',
        'http://shopify.com',
        'http://airbnb.com',
        'http://instagram.com',
        'http://snapchat.com',
        'http://youtube.com',
        'http://baidu.com',
        'http://yahoo.com',
        'http://live.com',
        'http://linkedin.com',
        'http://yandex.ru',
        'http://netflix.com',
        'http://wordpress.com',
        'http://bing.com',
    ]
    
    
    class BaseException(Exception):
        pass
    
    
    class HTTPRequestFailed(BaseException):
        pass
    
    
    async def fetch(url, timeout=5):
        async with requests.Session() as session:
            try:
                resp = await session.get(url, timeout=timeout)
                resp.raise_for_status()
            except HTTPError:
                raise HTTPRequestFailed(f'Skipped: {resp.url} ({resp.status_code})')
            except Timeout:
                raise HTTPRequestFailed(f'Timeout: {url}')
            except RequestException as e:
                raise HTTPRequestFailed(e)
        return resp
    
    
    async def parse_html(html):
        bs = BeautifulSoup(html, 'html.parser')
        if not html: print(html)
        title = bs.title.text.strip()
        return title if title else "Unknown"
    
    
    async def run(sem, url):
        async with sem:
            start_t = time.time()
            resp = await fetch(url)
            title = await parse_html(resp.text)
            end_t = time.time()
            elapsed_t = end_t - start_t
            r_time = resp.elapsed.total_seconds()
            print(f'{url}, title: "{title}" (total: {elapsed_t:.2f}s, request: {r_time:.2f}s)')
            return resp
    
    
    async def main():
        sem = asyncio.Semaphore(MAX_REQUESTS)
        tasks = [asyncio.create_task(run(sem, url)) for url in URLS]
        for f in asyncio.as_completed(tasks):
            try:
                result = await f
            except Exception as e:
                print(e)
    
    
    if __name__ == '__main__':
        asyncio.run(main())
    

    Output:

    # time python req.py 
    http://google.com, title: "Google" (total: 0.69s, request: 0.58s)
    http://yandex.ru, title: "Яндекс" (total: 2.01s, request: 1.65s)
    http://github.com, title: "The world’s leading software development platform · GitHub" (total: 2.12s, request: 1.90s)
    Timeout: http://yahoo.com
    ...
    
    real    0m6.868s
    user    0m3.723s
    sys 0m0.524s
    

    Now, this may still not help you with your logging issue. The HTML tag that you're looking for (or the entire web page) could be generated by JavaScript so you'll need tools like requests-html that is using a headless browser to read content rendered by JavaScript.

    It's also possible that your login form is using CSRF protection, example with login to Django admin backend:

    >>> import requests
    >>> s = requests.Session()
    >>> get = s.get('http://localhost/admin/')
    >>> csrftoken = get.cookies.get('csrftoken')
    >>> payload = {'username': 'admin', 'password': 'abc123', 'csrfmiddlewaretoken': csrftoken, 'next': '/admin/'}
    >>> post = s.post('http://localhost/admin/login/?next=/admin/', data=payload)
    >>> post.status_code
    200
    

    We use session to perform a get request first, to get the token from the csrftoken cookie and then we login with two hidden form fields:

    <form action="/admin/login/?next=/admin/" method="post" id="login-form">
      <input type="hidden" name="csrfmiddlewaretoken" value="uqX4NIOkQRFkvQJ63oBr3oihhHwIEoCS9350fVRsQWyCrRub5llEqu1iMxIDWEem">
      <div class="form-row">
        <label class="required" for="id_username">Username:</label>
        <input type="text" name="username" autofocus="" required="" id="id_username">
      </div>
      <div class="form-row">
        <label class="required" for="id_password">Password:</label> <input type="password" name="password" required="" id="id_password">
        <input type="hidden" name="next" value="/admin/">
      </div>
        <div class="submit-row">
        <label>&nbsp;</label>
        <input type="submit" value="Log in">
      </div>
    </form>
    

    Note: examples are using Python 3.7+