Search code examples
pythonauthenticationbeautifulsouplxmlaccess-token

Authentication results in 404 code


There is a website I need to scrape, but before I do I need to login.

There seems to be three things I need to get in, the username, password and authenticity token. The user name and password I know, but I am not sure how to access the token.

This is what I have tried:

import requests
from lxml import html

login_url = "https://urs.earthdata.nasa.gov/home"

session_requests = requests.session()
result = session_requests.get(login_url)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[@name='authenticity_token']/@value")))[0]

payload = {"username": "my_name",
           "password": "my_password",
           "authenticity_token": authenticity_token}

result = session_requests.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url)
)

print (result)

This results in :

<Response [404]>

My name and password are entered correctly so it is the token that must be going wrong. I think the problem is this line:

authenticity_token = list(set(tree.xpath("//input[@name='authenticity_token']/@value")))[0]

or this line:

payload = {"username": "my_name",
               "password": "my_password",
               "authenticity_token": authenticity_token}

by looking at the source code on the webpage I noticed there is a authenticity_token, csrf-token and a csrf-param. So its possible these are in the wrong order, but I tried all the combinations.

EDIT:

Here is a beautiful soup approach that results in 404 again.

s = requests.session()                                                         
response = s.get(login_url)   

soup = BeautifulSoup(response.text, "lxml")                                            
for n in soup('input'): 
    if n['name'] == 'authenticity_token':                                             
        token = n['value'] 
    if n['name'] == 'utf8':
        utf8 = n['value']                                               
        break

auth = {                                                                       
    'username': 'my_username'                                                       
    , 'password': 'my_password'                                                 
    , 'authenticity_token': token    
    , 'utf8' : utf8                                                 
}    

s.post(login_url, data=auth) 

Solution

  • If you inspect the page you'll notice that form action value is '/login', so you have to submit your data to https://urs.earthdata.nasa.gov/login'.

    login_url = "https://urs.earthdata.nasa.gov/login"
    home_url = "https://urs.earthdata.nasa.gov/home"
    
    s = requests.session()                                                         
    soup = BeautifulSoup(s.get(home_url).text, "lxml")                                            
    data = {i['name']:i.get('value', '') for i in soup.find_all('input')}
    data['username'] = 'my_username'
    data['password'] = 'my_password'
    result = s.post(login_url, data=data)
    
    print(result)
    

    < Response [200]>

    A quick example with selenium:

    from selenium import webdriver
    
    driver = webdriver.Firefox()
    url = 'https://n5eil01u.ecs.nsidc.org/MOST/MOD10A1.006/'
    
    driver.get(url)
    driver.find_element_by_name('username').send_keys('my_username')
    driver.find_element_by_name('password').send_keys('my_password')
    driver.find_element_by_id('login').submit()
    
    html = driver.page_source
    driver.quit()