Search code examples
javascriptpythonhtmlweb-scraping

python javascript scrape automatically


Python novice here.

I am trying to scrape company information from the Dutch Transparency Benchmark website for a number of different companies, but I'm at a loss as to how to make it work. I've tried

pd.read_html(https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793)

and

requests.get("https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793")

and then working from there. However, it seems like the data is dynamically generated/queried, and thus not actually contained in the html source code these methods retrieve.

If I go to my browser's developer tools and copy the "final" html as shown there in the "Elements" tab, the whole information is in there. But as I'd like to repeat the process for several of the companies, is there any way to automate it?

Alternatively, if there's no direct way to obtain the info from the html, there might be a second possibility. The site allows to download the information as an Excel-file for each individual company. Is it possible to somehow automatically "click" the download button and save the file somewhere? Then I might be able to loop over all the companies I need.

Please excuse if this question is poorly worded, and thank you very much in advance

Tusen takk!

Edit: I have also tried it using BeautifulSoup, as @pmkroeker suggested. But I'm not really sore how to make it work so that it first runs all the javascript so the site actually contains the data.


Solution

  • I think you will either want use a library to render the page. This answer seems to apply to python. I will also copy the code from that answer for completeness.


    You can pip install selenium from a command line, and then run something like:

    from selenium import webdriver
    from urllib2 import urlopen
    
    url = 'http://www.google.com'
    file_name = 'C:/Users/Desktop/test.txt'
    
    conn = urlopen(url)
    data = conn.read()
    conn.close()
    
    file = open(file_name,'wt')
    file.write(data)
    file.close()
    
    browser = webdriver.Firefox()
    browser.get('file:///'+file_name)
    html = browser.page_source
    browser.quit()
    

    I think you could probably skip the file write and just pass it to that browser.get call, but I'll leave that to you to find out.

    The other thing you can do is look for the ajax calls in a browser developer tool. i.e. when using chrome the 3 dots -> more tools -> developer tools or press something like F12. Then look at the network tab. There will be various requests. You will want to click one, click the Preview tab, and then go through each until you find a response that looks like json data. You are effectively look for their API calls that they used to get the data to generate things. Once you find one, click the Headers tab and you will see a Request URL.

    i.e. this https://sa-tb.nl/api/widget/chart/survey/4/sector/38 has lots of data

    The problem here is it may or may not be repeatable (API may change, id's may change). You may have a similar problem with just HTML scraping as the HTML could change just as easily.