Search code examples
rrcurl

scrape website with hidden csrf token at login with R


As part of a late night project I am working on out of interest, I am trying to scrape my Uber trip data off of their website.

I have had a look at the code of the login page on

https://login.uber.com/login

and have seen that they use POST method in their form setup as follows:

<form method="post" class="form" novalidate="">
<input type="hidden" name="_csrf_token" value="1452201446-01-hujzoBTxkYPrJessd6zQwnD2ZOFxMOVgIYN8iXntr6c=">
<input type="hidden" data-js="access-token" name="access_token">
  <a href="#" class="btn btn--full btn--facebook" data-js="facebook-connect">
      <span class="push--ends flush">Continue with Facebook</span>
  </a>
  <p class="primary-font primary-font--semibold text-uber-white background-line push--top push--bottom">
    <span>or use email</span>
  </p>

<div class="form-group push-tiny--top flush--bottom">
  <input type="email" name="email" class="text-input square--bottom " placeholder="Email Address" value="" id="email">
</div>
<div class="form-group push--bottom">
  <input type="password" name="password" class="text-input square--top " placeholder="Password" id="password">
</div>

What I have read up is that one needs to send csrf token along when trying to scrape

library(RCurl)
library(XML)

URL_str<-"https://login.uber.com/login"
URL_str2<-"https://riders.uber.com/trips"

email<-"exampleMail@gmail.com"
pass<-"thisisalong"
token<- "1452201446-01-hujzoBTxkYPrJessd6zQwnD2ZOFxMOVgIYN8iXntr6c="

params <- list('email' = email,
               'password' = pass,
               '_csrf_token'=token)

URL_doc = postForm(URL_str2, style="POST",
                   .params=params)

If I now try and scrape the site, I get

ERROR: FORBIDDEN

I have seen some examples in python with similar websites. Can the same be done in R?


Solution

  • The final answer ended up being quite simple:

    library(rvest)
    library(RCurl)
    library(XML)
    
    session <-html_session("https://login.uber.com/login")
    
    email<-"exampleMail@gmail.com"
    pass<-"thisisalong"
    
    #Handling the html_form
    form<-html_form(session)[[1]]
    form<-set_values(form, email=email, password=pass)
    
    #Found that unless I explicitly assigned the value to empty it created errors
    form$url<-""
    
    session_open<-submit_form(session,form)
    
    session_open<-jump_to(session_open,URL_str2)
    

    From here on in its straight forward pulling tables using html_table and isolating nodes using html_nodes. Hope this helps