Search code examples
pythonseleniumweb-scrapinginternet-explorerbeautifulsoup

Webscraping websites with old unsupported Internet Explorer browser


I am trying to scrape the following website(https://iltacon2022.expofp.com/) and I keep receiving the following error (full output print below). I'm not sure what the issue is and I was wondering if someone could help me.

 if (window.navigator.userAgent.indexOf("Trident/") !== -1) {
                alert("Your are using old unsupported Internet Explorer browser.\nPlease upgrade to view this page properly."

I've tried using selenium and the requests module, but I seem to experience the same problem either way.

Code trials:

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
import random
import requests

options = Options()
options.headless = False
driver = webdriver.Firefox(options=options)

url = "https://iltacon2022.expofp.com/"

driver.get(url)

time.sleep(6)

soup = bs(driver.page_source, 'lxml')

driver.quit()

print(soup)

Output:

<html lang="en"><head>
<meta charset="utf-8"/>
<link href="https://iltacon2022.expofp.com/packages/master/favicon.png" rel="shortcut icon"/>
<meta content="user-scalable=no, initial-scale=1.0, maximum-scale=1.0, width=device-width" name="viewport"/>
<!-- <meta name="theme-color" content="#000000" /> -->
<title>ILTACON2022 – Gaylord National Resort and Convention Center | August 22–25, 2022 | Monday – Thursday – Expo Floor Plan by ExpoFP</title>
<script>
            if (window.navigator.userAgent.indexOf("Trident/") !== -1) {
                alert("Your are using old unsupported Internet Explorer browser.\nPlease upgrade to view this page properly.");
            }
        </script>
<style>
            html,
            body {
                touch-action: none;
                margin: 0;
                padding: 0;
                height: 100%;
                width: 100%;
                background: #ebebeb;
                position: fixed;
                overflow: hidden;
            }
            @media (max-width: 820px) and (min-width: 500px) {
                html {
                    font-size: 13px;
                }
            }
        </style>
<style>
            .lds-grid {
                top: 42vh;
                margin: 0 auto;
                display: block;
                position: relative;
                width: 64px;
                height: 64px;
            }

            .lds-grid div {
                position: absolute;
                width: 13px;
                height: 13px;
                background: #aaa;
                border-radius: 50%;
                /* border: solid 1px #fff; */
                animation: lds-grid 1.2s linear infinite;
            }

            .lds-grid div:nth-child(1) {
                top: 6px;
                left: 6px;
                animation-delay: 0s;
            }

            .lds-grid div:nth-child(2) {
                top: 6px;
                left: 26px;
                animation-delay: -0.4s;
            }

            .lds-grid div:nth-child(3) {
                top: 6px;
                left: 45px;
                animation-delay: -0.8s;
            }

            .lds-grid div:nth-child(4) {
                top: 26px;
                left: 6px;
                animation-delay: -0.4s;
            }

            .lds-grid div:nth-child(5) {
                top: 26px;
                left: 26px;
                animation-delay: -0.8s;
            }

            .lds-grid div:nth-child(6) {
                top: 26px;
                left: 45px;
                animation-delay: -1.2s;
            }

            .lds-grid div:nth-child(7) {
                top: 45px;
                left: 6px;
                animation-delay: -0.8s;
            }

            .lds-grid div:nth-child(8) {
                top: 45px;
                left: 26px;
                animation-delay: -1.2s;
            }

            .lds-grid div:nth-child(9) {
                top: 45px;
                left: 45px;
                animation-delay: -1.6s;
            }

            @keyframes lds-grid {
                0%,
                100% {
                    opacity: 1;
                }

                50% {
                    opacity: 0.5;
                }
            }
        </style>
<link as="script" href="https://iltacon2022.expofp.com/data/data.js" rel="preload"/>
<link as="script" href="https://iltacon2022.expofp.com/data/fp.svg.js" rel="preload"/>
<link as="script" href="https://iltacon2022.expofp.com/packages/master/floorplan.js" rel="preload"/>
<link as="script" href="https://iltacon2022.expofp.com/packages/master/vendors~floorplan.js" rel="preload"/>
<link as="style" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/css/fontawesome-all.min.css" rel="preload"/>
<link as="style" href="https://iltacon2022.expofp.com/packages/master/vendor/sanitize-css/sanitize.css" rel="preload"/>
<link as="style" href="https://iltacon2022.expofp.com/packages/master/vendor/perfect-scrollbar/css/perfect-scrollbar.css" rel="preload"/>
<!-- Fonts are anonymous because those will be loaded with FontFace -->
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/webfonts/fa-regular-400.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/webfonts/fa-solid-900.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/webfonts/fa-light-300.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/fonts/oswald-v17-cyrillic_latin-500.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/fonts/oswald-v17-cyrillic_latin-300.woff2" rel="preload"/>
<script src="https://iltacon2022.expofp.com/data/data.js"></script><script src="https://iltacon2022.expofp.com/data/wf.data.js"></script><script src="https://iltacon2022.expofp.com/data/fp.svg.js"></script><script charset="utf-8" src="https://iltacon2022.expofp.com/packages/master/vendors~floorplan.js"></script><script charset="utf-8" src="https://iltacon2022.expofp.com/packages/master/floorplan.js"></script></head>
<body>
<noscript>You need to enable JavaScript to run this app.</noscript>
<div class="expofp-floorplan" data-event-id="iltacon2022"><div></div></div>
<script src="https://iltacon2022.expofp.com/packages/master/expofp.js"></script>
</body></html>

Solution

  • Your task is not trivial. Here is one possible solution:

    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.support.ui import Select
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.action_chains import ActionChains
    from selenium.webdriver.common.keys import Keys
    import time as t
    import pandas as pd
    
    
    chrome_options = Options()
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument('disable-notifications')
    chrome_options.add_argument("window-size=1280,720")
    
    webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
    browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
    actions = ActionChains(browser)
    
    url = 'https://iltacon2022.expofp.com/'
    browser.get(url) 
    c_list = []
    parent_el = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//div[@data-event-id="iltacon2022"]/div')))
    parent_el_shadow_root = parent_el.shadow_root 
    t.sleep(5)
    companies_div = parent_el_shadow_root.find_element(By.CSS_SELECTOR, 'div[class="overlay-content__scrollable ps ps--active-y"]')
    while True:
        try:
            companies = parent_el_shadow_root.find_elements(By.CSS_SELECTOR, "a[class = 'exhibitor-row list-row  ']")
            for c in companies:
                if len(c.text) > 3:
                    c_list.append((c.text.replace('\n', ': '), c.get_attribute('href')))
            print(f'we found {len(c_list)} companies')
            actions.move_to_element(companies[len(c_list)]).perform()
            print("moving to element", companies[len(c_list)].text.replace('\n', ': '))
            t.sleep(1)
            companies[len(c_list)].send_keys(Keys.PAGE_DOWN)
            print('scrolled page down')
            t.sleep(2)
        except Exception as e:
            print('all done')
            break
    df = pd.DataFrame(list(set(c_list)), columns = ['Company', 'Url'])
    df.to_csv('surveillance_capitalists.csv')
    print(df)
    

    It's important to use Chrome/chromedriver, due to the way shadow root is located in the code above. The setup above is for linux, however you can create a working selenium/chromedriver setup on your machine, and then you just have to observe the imports, as well as the code after defining the browser/driver. The printout in the terminal will be quite verbose, it will tell you what's going on, and in the end will print out a dataframe with companies and their respective url (which will also save to disk as a csv file). You can then scrape those urls, just make sure you inspect every page properly, locate the shadow root and the elements inside it. Selenium documentation can be found at https://www.selenium.dev/documentation/

    For any questions, just comment here, or ask in the Selenium chat room, which I imagine is very helpful.