Search code examples
pythonurlweb-scrapingweb-search

How to get resulting URL from search?


I am trying to write a program that does chemical search on https://echa.europa.eu/ and gets the result. The "Search for Chemicals" field is on the middle of the main webpage. I want to get the resulting URLs from doing search for each chemicals by providing the cas number (ex. 67-56-1). It seems that the URL I get does not include the cas number provided.

https://echa.europa.eu/search-for-chemicals?p_p_id=disssimplesearch_WAR_disssearchportlet&p_p_lifecycle=0&_disssimplesearch_WAR_disssearchportlet_searchOccurred=true&_disssimplesearch_WAR_disssearchportlet_sessionCriteriaId=dissSimpleSearchSessionParam101401584308302720

I tried inserting different cas number (71-23-8) into "p_p_id" field, but it didn't give expected search result.
https://echa.europa.eu/search-for-chemicals?p_p_id=71-23-8

I also examined the headers of GET methods requested from Chrome which also did not include the cas number.

Is the website using variables to store the input query? Is there a way or a tool that can be used to get the resulting URL including searching cas number?

Once I figure this out, I'll be using Python to get the data and save it as excel file.

Thanks.


Solution

  • You need to get the JESSIONID cookie by requesting the main url once then send a POST on https://echa.europa.eu/search-for-chemicals. But this needs also some required URL parameters

    Using and :

    query="71-23-8"
    millis=$(($(date +%s%N)/1000000))
    curl -s -I -c cookie.txt 'https://echa.europa.eu/search-for-chemicals'
    curl -s -L -b cookie.txt 'https://echa.europa.eu/search-for-chemicals' \
        --data-urlencode "p_p_id=disssimplesearch_WAR_disssearchportlet" \
        --data-urlencode "p_p_lifecycle=1" \
        --data-urlencode "p_p_state=normal" \
        --data-urlencode "p_p_col_id=column-1" \
        --data-urlencode "p_p_col_count=2" \
        --data-urlencode "_disssimplesearch_WAR_disssearchportlet_javax.portlet.action=doSearchAction" \
        --data-urlencode "_disssimplesearch_WAR_disssearchportlet_backURL=https://echa.europa.eu/home?p_p_id=disssimplesearchhomepage_WAR_disssearchportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=2" \
        --data-urlencode "_disssimplesearchhomepage_WAR_disssearchportlet_sessionCriteriaId=" \
        --data "_disssimplesearchhomepage_WAR_disssearchportlet_formDate=$millis" \
        --data "_disssimplesearch_WAR_disssearchportlet_searchOccurred=true" \
        --data "_disssimplesearch_WAR_disssearchportlet_sskeywordKey=$query" \
        --data "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimer=on" \
        --data "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimerCheckbox=on"
    

    Using and scraping with

    import requests
    from bs4 import BeautifulSoup
    import time
    
    url = 'https://echa.europa.eu/search-for-chemicals'
    query = '71-23-8'
    
    s = requests.Session()
    s.get(url)
    
    r = s.post(url, 
        params = {
            "p_p_id": "disssimplesearch_WAR_disssearchportlet",
            "p_p_lifecycle": "1",
            "p_p_state": "normal",
            "p_p_col_id": "column-1",
            "p_p_col_count": "2",
            "_disssimplesearch_WAR_disssearchportlet_javax.portlet.action": "doSearchAction",
            "_disssimplesearch_WAR_disssearchportlet_backURL": "https://echa.europa.eu/home?p_p_id=disssimplesearchhomepage_WAR_disssearchportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=2",
            "_disssimplesearchhomepage_WAR_disssearchportlet_sessionCriteriaId": ""
        },
        data = {
            "_disssimplesearchhomepage_WAR_disssearchportlet_formDate": int(round(time.time() * 1000)),
            "_disssimplesearch_WAR_disssearchportlet_searchOccurred": "true",
            "_disssimplesearch_WAR_disssearchportlet_sskeywordKey": query,
            "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimer": "on",
            "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimerCheckbox": "on"
        }
    )
    soup = BeautifulSoup(r.text, "html.parser")
    table = soup.find("table")
    
    data = [
        (
            t[0].find("a").text.strip(), 
            t[0].find("a")["href"], 
            t[0].find("div", {"class":"substanceRelevance"}).text.strip(),
            t[1].text.strip(),
            t[2].text.strip(),
            t[3].find("a")["href"] if t[3].find("a") else "",
            t[4].find("a")["href"] if t[4].find("a") else "",
        )
        for t in (t.find_all('td') for t in table.find_all("tr"))
        if len(t) > 0 and t[0].find("a") is not None
    ]
    print(data)
    

    Note that I've set the timestamp parameter (formDate param) in case of it's actually checked on the server