I want to use requests.Sessions() to deliver my login information to a website. Once logged in I want to navigate to a second URL that can only be accessed once logged in. In order to scrape data from the second URL.
I am new to scraping and don't really have any experience with HTML
I am working in collaboratory if that makes any difference.
This is my code and the outputs:
import requests
page = requests.get("https://app.gristanalytics.com/Account/Login")
page
<Response [200]>
page.status_code
200
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
This is the output:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<link href="/lib/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
<link href="/lib/fontawesome/css/all.min.css" rel="stylesheet"/>
<link href="/lib/datetimepicker/bootstrap-datetimepicker.min.css" rel="stylesheet"/>
<link href="/lib/vue-multiselect/vue-multiselect.min.css" rel="stylesheet"/>
<link href="/css/site.css" rel="stylesheet"/>
<title>
Log in - Grist
</title>
</head>
<body>
<div>
<div class="text-center loginbox">
<form method="post" style="width:100%;max-width:350px;padding:15px;margin:0 auto;">
<img alt="" class="mb-4" src="/images/grist_logo_m_black.png"/>
<h1 class="h3 mb-3 font-weight-normal">
Please sign in
</h1>
<div class="text-danger validation-summary-valid" data-valmsg-summary="true">
<ul>
<li style="display:none">
</li>
</ul>
</div>
<label class="sr-only" for="inputEmail">
Email address
</label>
<input autofocus="" class="form-control my-1" data-val="true" data-val-email="The Email field is not a valid e-mail address." data-val-required="The Email field is required." id="Input_Email" name="Input.Email" placeholder="Email address" required="" type="email" value=""/>
<label class="sr-only" for="inputPassword">
Password
</label>
<input class="form-control my-1" data-val="true" data-val-required="The Password field is required." id="Input_Password" name="Input.Password" placeholder="Password" required="" type="password"/>
<div class="checkbox my-3">
<label>
<input data-val="true" data-val-required="The Remember me? field is required." id="Input_RememberMe" name="Input.RememberMe" type="checkbox" value="true"/>
Remember me
</label>
<p>
<a href="/Account/ForgotPassword">
Forgot your password?
</a>
</p>
</div>
<button class="btn btn-lg btn-primary btn-block" type="submit">
Sign in
</button>
<p class="mt-5 mb-3 text-muted">
© 2018-2022
</p>
<input name="__RequestVerificationToken" type="hidden" value="CfDJ8CxpSY-tCd5Ou0L0wqhntPACCikaoFBOUQLV0RgCaVUJgt9wRSd3p9aVswNuSLU6OPRKsbIm-qvOyZyZErcEm-E__Q2tPauexh3z_T02Oh5TZCpeY12PsUsERY3INO5LUBBmWXeUR6nG5BFHnnNdW70">
<input name="Input.RememberMe" type="hidden" value="false"/>
</input>
</form>
</div>
</div>
<script src="/lib/jquery-validation/dist/Jquery.validate.min.js">
</script>
<script src="/lib/jquery-validation-unobtrusive/jquery.validate.unobtrusive.min.js">
</script>
</body>
</html>
At this point I believe that the field names that I want to deliver the payload are: name="Input.Email" and name="Input.Password"
However I note that there is no action attribute in the HTML code, so I plan to send the payload to the original URL as you will see below.
payload = {
'Input.Email': 'MyEmail', #yes in practice this is my actual information instead of this placeholder
'Input.Password': 'MyPassword', #same here real password used instead
}
with requests.Session() as session:
post = session.post('https://app.gristanalytics.com/Account/Login', data=payload)
r = session.get('https://app.gristanalytics.com/Data/Brewhouse')
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())
The output of this is:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<link href="/lib/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
<link href="/lib/fontawesome/css/all.min.css" rel="stylesheet"/>
<link href="/lib/datetimepicker/bootstrap-datetimepicker.min.css" rel="stylesheet"/>
<link href="/lib/vue-multiselect/vue-multiselect.min.css" rel="stylesheet"/>
<link href="/css/site.css" rel="stylesheet"/>
<title>
Log in - Grist
</title>
</head>
<body>
<div>
<div class="text-center loginbox">
<form method="post" style="width:100%;max-width:350px;padding:15px;margin:0 auto;">
<img alt="" class="mb-4" src="/images/grist_logo_m_black.png"/>
<h1 class="h3 mb-3 font-weight-normal">
Please sign in
</h1>
<div class="text-danger validation-summary-valid" data-valmsg-summary="true">
<ul>
<li style="display:none">
</li>
</ul>
</div>
<label class="sr-only" for="inputEmail">
Email address
</label>
<input autofocus="" class="form-control my-1" data-val="true" data-val-email="The Email field is not a valid e-mail address." data-val-required="The Email field is required." id="Input_Email" name="Input.Email" placeholder="Email address" required="" type="email" value=""/>
<label class="sr-only" for="inputPassword">
Password
</label>
<input class="form-control my-1" data-val="true" data-val-required="The Password field is required." id="Input_Password" name="Input.Password" placeholder="Password" required="" type="password"/>
<div class="checkbox my-3">
<label>
<input data-val="true" data-val-required="The Remember me? field is required." id="Input_RememberMe" name="Input.RememberMe" type="checkbox" value="true"/>
Remember me
</label>
<p>
<a href="/Account/ForgotPassword">
Forgot your password?
</a>
</p>
</div>
<button class="btn btn-lg btn-primary btn-block" type="submit">
Sign in
</button>
<p class="mt-5 mb-3 text-muted">
© 2018-2022
</p>
<input name="__RequestVerificationToken" type="hidden" value="CfDJ8CxpSY-tCd5Ou0L0wqhntPAwaiYOz80Q50p5gOcDk9qSF-gR4JJpzNGOdSKiQOzcVPp8hBKgDaEwXOrbFnpgdYXkedfcnLQlXIJ1Z7HnIi5vKZybNd6VSKk_Xs5Az444e3Oug-u1UFcxq_OLX1Iu0wU">
<input name="Input.RememberMe" type="hidden" value="false"/>
</input>
</form>
</div>
</div>
<script src="/lib/jquery-validation/dist/Jquery.validate.min.js">
</script>
<script src="/lib/jquery-validation-unobtrusive/jquery.validate.unobtrusive.min.js">
</script>
</body>
</html>
Which is the same HTML as the first time, it is clear that I am not logged in, and as a result I can't get to the HTML code of the URL I want.
I have tried other variations for the payload field names including:
sample code for variation 1 would be
payload = {
'inputEmail': 'MyEmail', #yes in practice this is my actual information instead of this placeholder
'inputPassword': 'MyPassword', #same here real password used instead
}
I get no error or warning messages when running this code so I'm a bit stuck as to what to do.
The following code helped me to login and get me where I wanted to go!
Big thank you to @bushcat69 for the help he provided, I probably wouldn't have looked seriously at the verification token without them.
As well as the following [1, 2] stack exchange posts for additional information that I used.
with requests.Session() as session:
read = session.get('https://app.gristanalytics.com/Account/Login')
soup = BeautifulSoup(read.content, 'html.parser')
token = soup.select_one('[name="__RequestVerificationToken"]').get('value')
payload = {
'Input.Email': 'MyEmail@email.com',
'Input.Password': 'MyPassword',
'__RequestVerificationToken': token,
'Input.RememberMe': 'false'
}
post = session.post('https://app.gristanalytics.com/Account/Login', data=payload)
r = session.get('https://app.gristanalytics.com/Data/Brewhouse')
tastySoup = BeautifulSoup(r.content, 'html.parser')
print(tastySoup.prettify())
I am now having issues where it seems that some of the content that I want to scrape is working through Ajax / javascript which I don't know how to get. If you're having similar issues look into my future questions, I will also leave a comment here with the stackexchange/whatever website if I find content that helps me figure it out.