Search code examples
pythonhtmlweb-scrapingpython-requestspython-requests-html

How to use requests.Sessions() to deliver a package to a URL with no 'action' attribute in the HTML in Python


I want to use requests.Sessions() to deliver my login information to a website. Once logged in I want to navigate to a second URL that can only be accessed once logged in. In order to scrape data from the second URL.

I am new to scraping and don't really have any experience with HTML

I am working in collaboratory if that makes any difference.

This is my code and the outputs:

import requests

page = requests.get("https://app.gristanalytics.com/Account/Login")
page

<Response [200]>

page.status_code

200

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

This is the output:

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <link href="/lib/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
  <link href="/lib/fontawesome/css/all.min.css" rel="stylesheet"/>
  <link href="/lib/datetimepicker/bootstrap-datetimepicker.min.css" rel="stylesheet"/>
  <link href="/lib/vue-multiselect/vue-multiselect.min.css" rel="stylesheet"/>
  <link href="/css/site.css" rel="stylesheet"/>
  <title>
   Log in - Grist
  </title>
 </head>
 <body>
  <div>
   <div class="text-center loginbox">
    <form method="post" style="width:100%;max-width:350px;padding:15px;margin:0 auto;">
     <img alt="" class="mb-4" src="/images/grist_logo_m_black.png"/>
     <h1 class="h3 mb-3 font-weight-normal">
      Please sign in
     </h1>
     <div class="text-danger validation-summary-valid" data-valmsg-summary="true">
      <ul>
       <li style="display:none">
       </li>
      </ul>
     </div>
     <label class="sr-only" for="inputEmail">
      Email address
     </label>
     <input autofocus="" class="form-control my-1" data-val="true" data-val-email="The Email field is not a valid e-mail address." data-val-required="The Email field is required." id="Input_Email" name="Input.Email" placeholder="Email address" required="" type="email" value=""/>
     <label class="sr-only" for="inputPassword">
      Password
     </label>
     <input class="form-control my-1" data-val="true" data-val-required="The Password field is required." id="Input_Password" name="Input.Password" placeholder="Password" required="" type="password"/>
     <div class="checkbox my-3">
      <label>
       <input data-val="true" data-val-required="The Remember me? field is required." id="Input_RememberMe" name="Input.RememberMe" type="checkbox" value="true"/>
       Remember me
      </label>
      <p>
       <a href="/Account/ForgotPassword">
        Forgot your password?
       </a>
      </p>
     </div>
     <button class="btn btn-lg btn-primary btn-block" type="submit">
      Sign in
     </button>
     <p class="mt-5 mb-3 text-muted">
      © 2018-2022
     </p>
     <input name="__RequestVerificationToken" type="hidden" value="CfDJ8CxpSY-tCd5Ou0L0wqhntPACCikaoFBOUQLV0RgCaVUJgt9wRSd3p9aVswNuSLU6OPRKsbIm-qvOyZyZErcEm-E__Q2tPauexh3z_T02Oh5TZCpeY12PsUsERY3INO5LUBBmWXeUR6nG5BFHnnNdW70">
      <input name="Input.RememberMe" type="hidden" value="false"/>
     </input>
    </form>
   </div>
  </div>
  <script src="/lib/jquery-validation/dist/Jquery.validate.min.js">
  </script>
  <script src="/lib/jquery-validation-unobtrusive/jquery.validate.unobtrusive.min.js">
  </script>
 </body>
</html>

At this point I believe that the field names that I want to deliver the payload are: name="Input.Email" and name="Input.Password"

However I note that there is no action attribute in the HTML code, so I plan to send the payload to the original URL as you will see below.

payload = {
    'Input.Email':  'MyEmail', #yes in practice this is my actual information instead of this placeholder
    'Input.Password': 'MyPassword', #same here real password used instead
}
with requests.Session() as session:
  post = session.post('https://app.gristanalytics.com/Account/Login', data=payload)
  r = session.get('https://app.gristanalytics.com/Data/Brewhouse')
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())

The output of this is:

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <link href="/lib/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
  <link href="/lib/fontawesome/css/all.min.css" rel="stylesheet"/>
  <link href="/lib/datetimepicker/bootstrap-datetimepicker.min.css" rel="stylesheet"/>
  <link href="/lib/vue-multiselect/vue-multiselect.min.css" rel="stylesheet"/>
  <link href="/css/site.css" rel="stylesheet"/>
  <title>
   Log in - Grist
  </title>
 </head>
 <body>
  <div>
   <div class="text-center loginbox">
    <form method="post" style="width:100%;max-width:350px;padding:15px;margin:0 auto;">
     <img alt="" class="mb-4" src="/images/grist_logo_m_black.png"/>
     <h1 class="h3 mb-3 font-weight-normal">
      Please sign in
     </h1>
     <div class="text-danger validation-summary-valid" data-valmsg-summary="true">
      <ul>
       <li style="display:none">
       </li>
      </ul>
     </div>
     <label class="sr-only" for="inputEmail">
      Email address
     </label>
     <input autofocus="" class="form-control my-1" data-val="true" data-val-email="The Email field is not a valid e-mail address." data-val-required="The Email field is required." id="Input_Email" name="Input.Email" placeholder="Email address" required="" type="email" value=""/>
     <label class="sr-only" for="inputPassword">
      Password
     </label>
     <input class="form-control my-1" data-val="true" data-val-required="The Password field is required." id="Input_Password" name="Input.Password" placeholder="Password" required="" type="password"/>
     <div class="checkbox my-3">
      <label>
       <input data-val="true" data-val-required="The Remember me? field is required." id="Input_RememberMe" name="Input.RememberMe" type="checkbox" value="true"/>
       Remember me
      </label>
      <p>
       <a href="/Account/ForgotPassword">
        Forgot your password?
       </a>
      </p>
     </div>
     <button class="btn btn-lg btn-primary btn-block" type="submit">
      Sign in
     </button>
     <p class="mt-5 mb-3 text-muted">
      © 2018-2022
     </p>
     <input name="__RequestVerificationToken" type="hidden" value="CfDJ8CxpSY-tCd5Ou0L0wqhntPAwaiYOz80Q50p5gOcDk9qSF-gR4JJpzNGOdSKiQOzcVPp8hBKgDaEwXOrbFnpgdYXkedfcnLQlXIJ1Z7HnIi5vKZybNd6VSKk_Xs5Az444e3Oug-u1UFcxq_OLX1Iu0wU">
      <input name="Input.RememberMe" type="hidden" value="false"/>
     </input>
    </form>
   </div>
  </div>
  <script src="/lib/jquery-validation/dist/Jquery.validate.min.js">
  </script>
  <script src="/lib/jquery-validation-unobtrusive/jquery.validate.unobtrusive.min.js">
  </script>
 </body>
</html>

Which is the same HTML as the first time, it is clear that I am not logged in, and as a result I can't get to the HTML code of the URL I want.

I have tried other variations for the payload field names including:

  1. inputEmail (from for=)
  2. Input_Email (from id=)
  3. email (from type=)

sample code for variation 1 would be

payload = {
    'inputEmail':  'MyEmail', #yes in practice this is my actual information instead of this placeholder
    'inputPassword': 'MyPassword', #same here real password used instead
}

I get no error or warning messages when running this code so I'm a bit stuck as to what to do.


Solution

  • The following code helped me to login and get me where I wanted to go!

    Big thank you to @bushcat69 for the help he provided, I probably wouldn't have looked seriously at the verification token without them.

    As well as the following [1, 2] stack exchange posts for additional information that I used.

    with requests.Session() as session:
      read = session.get('https://app.gristanalytics.com/Account/Login')
      soup = BeautifulSoup(read.content, 'html.parser')
      token = soup.select_one('[name="__RequestVerificationToken"]').get('value')
      payload = {
        'Input.Email':  'MyEmail@email.com',
        'Input.Password': 'MyPassword',
        '__RequestVerificationToken': token,
        'Input.RememberMe': 'false'
    }
      post = session.post('https://app.gristanalytics.com/Account/Login', data=payload)
      r = session.get('https://app.gristanalytics.com/Data/Brewhouse')
      tastySoup = BeautifulSoup(r.content, 'html.parser')
      print(tastySoup.prettify())
    

    I am now having issues where it seems that some of the content that I want to scrape is working through Ajax / javascript which I don't know how to get. If you're having similar issues look into my future questions, I will also leave a comment here with the stackexchange/whatever website if I find content that helps me figure it out.