Search code examples
javascriptpython-3.xrestweb-scrapinginsomnia

web scraping data by modifying javascript parameters


I am trying to scrape intraday prices for a company, using this website:Enel Intraday

When the website pulls the data, it splits them into few hundreds pages, which makes it very time consuming to pull the data from. Using insomnia.rest (for the first time), i have been trying to play with the URL GET or try and find the actual javascrip function that returns these table values but without success.

Having inspected the search button, i find that the JS function is called "searchIntraday" and use a form as input called "intraday_form".

inspect Trova button

I am basically trying to get the following data in 1 call rather having to go through all tab pages, so a full day would look like this:

Time    Last Trade Price    Var %   Last Volume Type
5:40:49 PM  7.855   -2.88   570 AT
5:38:17 PM  7.855   -2.88   300 AT
5:37:10 PM  7.855   -2.88   290 AT
5:36:06 PM  7.855   -2.88   850 AT
5:35:56 PM  7.855   -2.88   14,508,309  UT
5:29:59 PM  7.872   -2.67   260 AT
5:29:59 PM  7.871   -2.68   4,300   AT
5:29:59 PM  7.872   -2.67   439 AT
5:29:59 PM  7.872   -2.67   3,575   AT
5:29:59 PM  7.87    -2.7    1,000   AT
5:29:59 PM  7.87    -2.7    1,000   AT
5:29:59 PM  7.87    -2.7    1,000   AT
5:29:59 PM  7.87    -2.7    4,000   AT
5:29:59 PM  7.87    -2.7    300 AT
5:29:59 PM  7.87    -2.7    2,000   AT
5:29:59 PM  7.87    -2.7    200 AT
5:29:59 PM  7.87    -2.7    400 AT
5:29:59 PM  7.87    -2.7    500 AT
5:29:59 PM  7.872   -2.67   1,812   AT
5:29:59 PM  7.872   -2.67   5,000   AT

..................................................

Time    Last Trade Price    Var %   Last Volume Type
9:00:07 AM  8.1 0.15    933,945 UT

which for that day is iterating from page 1 to page 1017!

I looked at the below page for help:

JS Scrape article

Stackflow similar issue with answer

Screen copy of Insomnia report


Solution

  • The data doesn't appear to be generated by javascript, but rather by loading pages. The image below is the response I get when I load the link below. You can see that the location of the request matches the location on the page and that the HTML for the table is sent along with the page response.

    The HTML in the response indicates that the pages are generated on the server side rather than the client side. Unfortunately, unless you find a way where you can browse and see all the results you want in one shot, you're going to have to iterate through each page. If you do manage to find a magic url, you can just process that one instead.

    https://www.borsaitaliana.it/borsa/azioni/contratti.html?isin=IT0003128367&lang=en&page=10

    enter image description here

    I decided to give it a whirl to see what kind of performance I could get. Below is a complete script that iterates through the first 100 pages.

    import pandas as pd
    import requests
    
    url = "https://www.borsaitaliana.it/borsa/azioni/contratti.html?isin=IT0003128367&lang=en&page="
    
    df = pd.concat([
        pd.read_html(requests.get(url + str(page)).content)[0] 
        for page in range(100)
    ])
    
    df.to_csv('enel.csv', index=False)
    

    Running it on my machine, it took 1.25 minutes for 100 pages.

    $ time python scrape.py 
    
    real    1m16.914s
    user    0m4.039s
    sys 0m0.729s
    

    This would be about 15 minutes per stock. I guess that's 7.5 hours for 30 stocks assuming they're all about the same length. You could run that overnight and it will be ready for you in the morning.