Search code examples
pythonweb-scrapingpython-requestscdn

Scraping javascript of CDN data with python


There is a large dataset, full of neatly-stored tabular data, found here that I would like to parse through and save locally.

The problem is, no matter how deep I "drill down" to inspect the source code, there isn't any actual data, nor any discernible source page.

My question is, is it therefore even possible to access the data via the typical requests.get() and .content etc.? Or would something like selenium do the trick? If not these two options, then what?

Thanks in advance.


Solution

  • See my comment for what it's worth here's the request that should work but doesn't... For reasons I'm not sure, unless there's security at their end with regard to cookies.

    Inspecting the page, it's making a POST request to c0cre127.caspio.com/dp/311a1000697d9171cc1c4128ae42. Also what you get back is in a structured format. You can see it in the preview within the request in network tools. Interesting the 'responseText' which gives you the data, is all in html. So theoretically you could just parse this part of the data to grab what you need. The problem is when I recreate this HTTP request, the AppKey which part of the cookie needed according to the request, says it's wrong.

    So selenium would work, not sure I can do much about the AppKey.

    import requests
    
    cookies = {
        'cbParamList': '',
        'cbCookieAccepted': '1',
        'AppKey': '311a1000697d9171cc1c4128ae42',
        'AWSALB': '76fnReAlqLZyJz4gNmSMnGc3oluXMlbsrGwaF+kcm4Rg8fklrjjrxvmez+XxXXg/yDle490fw/MKBNPWCyoGAiihFYgcWQ1RSp0vxSGJHDnfXncHSQuprTjv8Fjk',
        'AWSALBCORS': '76fnReAlqLZyJz4gNmSMnGc3oluXMlbsrGwaF+kcm4Rg8fklrjjrxvmez+XxXXg/yDle490fw/MKBNPWCyoGAiihFYgcWQ1RSp0vxSGJHDnfXncHSQuprTjv8Fjk',
    }
    
    headers = {
        'authority': 'c0cre127.caspio.com',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
        'content-type': 'multipart/form-data; boundary=----WebKitFormBoundarykaIBnhjgBEZ0L714',
        'accept': '*/*',
        'origin': 'https://c0cre127.caspio.com',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://c0cre127.caspio.com/dp/311a1000697d9171cc1c4128ae42',
        'accept-language': 'en-US,en;q=0.9',
    }
    
    params = (
        ('rnd', '1596940878792'),
    )
    
    data = '$------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="cbUniqueFormId"\\r\\n\\r\\n_69831fa53c178f\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="ComparisonType1_1"\\r\\n\\r\\n\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="MatchNull1_1"\\r\\n\\r\\nN\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="FieldName2"\\r\\n\\r\\nDate\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="Operator2"\\r\\n\\r\\nOR\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="NumCriteriaDetails2"\\r\\n\\r\\n1\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="ComparisonType2_1"\\r\\n\\r\\n=\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="MatchNull2_1"\\r\\n\\r\\nN\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="FieldName3"\\r\\n\\r\\n\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="Operator3"\\r\\n\\r\\n\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="NumCriteriaDetails3"\\r\\n\\r\\n1\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="ComparisonType3_1"\\r\\n\\r\\n\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="MatchNull3_1"\\r\\n\\r\\nN\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="FieldName4"\\r\\n\\r\\nProperty\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="Operator4"\\r\\n\\r\\nOR\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="NumCriteriaDetails4"\\r\\n\\r\\n1\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="ComparisonType4_1"\\r\\n\\r\\n=\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="MatchNull4_1"\\r\\n\\r\\nN\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="FieldName5"\\r\\n\\r\\n\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="Operator5"\\r\\n\\r\\n\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="NumCriteriaDetails5"\\r\\n\\r\\n1\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="ComparisonType5_1"\\r\\n\\r\\n\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="MatchNull5_1"\\r\\n\\r\\nN\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="FieldName6"\\r\\n\\r\\nZone\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="Operator6"\\r\\n\\r\\nOR\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="NumCriteriaDetails6"\\r\\n\\r\\n1\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="ComparisonType6_1"\\r\\n\\r\\n=\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="MatchNull6_1"\\r\\n\\r\\nN\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="AppKey"\\r\\n\\r\\n311a1000697d9171cc1c4128ae42\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="PrevPageID"\\r\\n\\r\\n1\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="cbPageType"\\r\\n\\r\\nSearch\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="PageID"\\r\\n\\r\\n2\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="GlobalOperator"\\r\\n\\r\\nAND\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="NumCriteria"\\r\\n\\r\\n6\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="Search"\\r\\n\\r\\n1\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="Value2_1"\\r\\n\\r\\n04/05/2020\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="Value4_1"\\r\\n\\r\\nAtterbury\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="Value6_1"\\r\\n\\r\\nCentral\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="ClientQueryString"\\r\\n\\r\\n\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="AjaxAction"\\r\\n\\r\\nSearchForm\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="GridMode"\\r\\n\\r\\nFalse\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="cbUniqueFormId"\\r\\n\\r\\n_69831fa53c178f\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="AjaxActionHostName"\\r\\n\\r\\nhttps://c0cre127.caspio.com\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714\\r\\nContent-Disposition: form-data; name="cbAjaxReferrer"\\r\\n\\r\\nhttps://c0cre127.caspio.com/dp/311a1000697d9171cc1c4128ae42\\r\\n------WebKitFormBoundarykaIBnhjgBEZ0L714--\\r\\n'
    
    response = requests.post('https://c0cre127.caspio.com/dp/311a1000697d9171cc1c4128ae42', headers=headers, params=params, cookies=cookies, data=data)
    

    Output

    'Undefined AppKey. (<a href="http://www.caspio.com/l/default.ashx?s=157">Caspio Bridge</a> error) (60011)'
    

    Update

    enter image description here