Search code examples
python-3.xseleniumpython-requestshttpresponsehttp-status-codes

Fetching the status code of multiple web URL's using python-Requests / urllib3 / or selenium module


I'm trying to write a python script to fetch the HTTP status code and response for ~ 200 URLs. The final output is the show these details in an html format with ULR name and there status code, response message, errors if any and screenshot of the page. I have tried using requests and urllib module for developing this script but my code breaks up in between if any HTTPException occurs without capturing the status code and response message for that particular URL. As an alternative solution, I have developed another Python script with selenium module where in I'm trying to capture the performance logs for the URL, specifically "Network.responseReceived".

from selenium import webdriver
from datetime import datetime
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

# enable browser logging
d = DesiredCapabilities.CHROME
d['loggingPrefs'] = { 'performance':'ALL' }
options = webdriver.ChromeOptions()  
options.add_argument("--headless")  
driver = webdriver.Chrome(chrome_options=options, executable_path="C:\\chromedriver_win32\\chromedriver.exe")
#driver = webdriver.Ie(executable_path="C:\\IE_driver\\MicrosoftWebDriver.exe")
driver.get("https://www.google.com")
#driver.get('https://www.google.com/nonexistant')

print(driver.title)
performance_log = driver.get_log('performance')

for entry in performance_log:
    print(type(entry))
    print (entry)
    print("================================================")
    print(" ")
    print(" ")

driver.close()

Below is the output that I have got.

Google
<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Network.loadingFinished","params":{"encodedDataLength":0,"requestId":"D99D380DD024B8928B5EAAC76E447956","shouldReportCorbBlocking":false,"timestamp":528401.402473}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228343}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Page.frameNavigated","params":{"frame":{"id":"8DBAE0AE8594201DC3D129C819A696C8","loaderId":"D99D380DD024B8928B5EAAC76E447956","mimeType":"text/plain","securityOrigin":"://","url":"data:,"}}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228343}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Page.loadEventFired","params":{"timestamp":528401.409908}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228344}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Page.frameStoppedLoading","params":{"frameId":"8DBAE0AE8594201DC3D129C819A696C8"}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228346}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Page.domContentEventFired","params":{"timestamp":528401.41067}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228347}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Network.requestWillBeSent","params":{"documentURL":"https://www.google.com/","frameId":"8DBAE0AE8594201DC3D129C819A696C8","hasUserGesture":false,"initiator":{"type":"other"},"loaderId":"16D0090B144D4D0D6DB68B993CE5DE12","request":{"headers":{"Upgrade-Insecure-Requests":"1","User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/72.0.3626.109 Safari/537.36"},"initialPriority":"VeryHigh","method":"GET","mixedContentType":"none","referrerPolicy":"no-referrer-when-downgrade","url":"https://www.google.com/"},"requestId":"16D0090B144D4D0D6DB68B993CE5DE12","timestamp":528401.455107,"type":"Document","wallTime":1554297228.37452}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228378}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Network.responseReceived","params":{"frameId":"8DBAE0AE8594201DC3D129C819A696C8","loaderId":"16D0090B144D4D0D6DB68B993CE5DE12","requestId":"16D0090B144D4D0D6DB68B993CE5DE12","response":{"connectionId":17,"connectionReused":false,"encodedDataLength":6681,"fromDiskCache":false,"fromServiceWorker":false,"headers":{"alt-svc":"quic=\\":443\\"; ma=2592000; v=\\"46,44,43,39\\"","cache-control":"private, max-age=0","content-encoding":"gzip","content-length":"65219","content-type":"text/html; charset=UTF-8","date":"Wed, 03 Apr 2019 13:13:52 GMT","expires":"-1","p3p":"CP=\\"This is not a P3P policy! See g.co/p3phelp for more info.\\"","server":"gws","set-cookie":"1P_JAR=2019-04-03-13; expires=Fri, 03-May-2019 13:13:52 GMT; path=/; domain=.google.com\\nNID=180=fV81eC5C8adCVzltTPlJnIxiDUi4bSEzqRVHIQwx7z5S75opd6k3fmtLeGNOllEqRlpcQ-X31RSveq0FgdL5e0GBcVZxYZjzI9g2Bgn_Wepj5RfErPoo5re54HFO-sgiXV5vqNftY7JHm60YxVYQXJqp9HhpdbpB0cJ3HLOCguo; expires=Thu, 03-Oct-2019 13:13:52 GMT; path=/; domain=.google.com; HttpOnly","status":"200","x-frame-options":"SAMEORIGIN","x-xss-protection":"0"},"mimeType":"text/html","protocol":"h2","remoteIPAddress":"172.217.168.196","remotePort":443,"requestHeaders":{":authority":"www.google.com",":method":"GET",":path":"/",":scheme":"https","accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","accept-encoding":"gzip, deflate, br","upgrade-insecure-requests":"1","user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/72.0.3626.109 Safari/537.36"},"securityDetails":{"certificateId":0,"certificateTransparencyCompliance":"unknown","cipher":"AES_128_GCM","issuer":"Google Internet Authority G3","keyExchange":"","keyExchangeGroup":"X25519","protocol":"TLS 1.3","sanList":["www.google.com"],"signedCertificateTimestampList":[],"subjectName":"www.google.com","validFrom":1551433595,"validTo":1558689900},"securityState":"secure","status":200,"statusText":"","timing":{"connectEnd":3683.223,"connectStart":2467.054,"dnsEnd":2467.054,"dnsStart":2352.226,"proxyEnd":2351.998,"proxyStart":86.464,"pushEnd":0,"pushStart":0,"receiveHeadersEnd":3976.231,"requestTime":528401.456284,"sendEnd":3687.307,"sendStart":3685.241,"sslEnd":3683.104,"sslStart":2620.349,"workerReady":-1,"workerStart":-1},"url":"https://www.google.com/"},"timestamp":528405.434789,"type":"Document"}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297232388}
================================================



I need to parse Network.responseReceived details as it has all the required details. So what should I do to parse the details from the Network.responseReceived logs.


Solution

  • Transform the "message" key of every entry to a python dict, and extract the needed property.

    In the beginning of your script, add an import of the json library; then, inside the loop over performance_log:

    for entry in performance_log:
        message = json.loads(entry['message'])
    

    Now the variable message will be a normal python dictionary, and you can get any property you need out of it. For instance, this is the status code:

    print(message['message']['params']['response']['status'])
    

    And this is the target url:

    print(message['message']['params']['response']['url'])
    

    Keep in mind you will get an entry of every request for resource the browser/the html creates - you'll probably want to filter only to the top-most/domain ones.