Search code examples
pythonseleniumxpathscreen-scraping

Cannot locate any parts of the body of an element with selenium for python


Life has been tough on me lately. I have tried to scrape a website (https://osf.io/preprints/discover?subject=bepress%7CSocial%20and%20Behavioral%20Sciences) with some of the HTML code below for a week now and have tried multiple things to get it to work. I need the div ID ember499 all the way at the bottom. This div is the one that contains the whole website, and if I cant access it, I cant scrape anything. There are 4 divs in the main body tag, MaxJax_Message, ember-bootstrap-wormhole, ember-basic-dropdown-wormhole and ember499 as seen below:


<body class="ember-application">
<div id="MathJax_Message" style="display: none;"></div>
    <noscript>
        <p>
        For full functionality of this site it is necessary to enable JavaScript.
        Here are
            <a href='https://www.enable-javascript.com/' target='_blank'> instructions for enabling JavaScript in your web browser</a>.
        </p>
    </noscript>

    <script> window.prerenderReady = false; </script>

    

    <script src="//cdnjs.cloudflare.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
                    <script src="//cdnjs.cloudflare.com/ajax/libs/ember.js/2.18.0/ember.min.js"></script>

    <script>
        (function() {
            var encodedConfig = document.head.querySelector("meta[name$='/config/environment']").content;
            var config = JSON.parse(unescape(encodedConfig));
            var assetSuffix = config.ASSET_SUFFIX ? '-' + config.ASSET_SUFFIX : '';
            var origin = window.location.origin;
            window.isProviderDomain = !~config.OSF.url.indexOf(origin);
            var prefix = '/' + (window.isProviderDomain ? '' : 'preprints/') + 'assets/';
            [
                'vendor',
                'preprint-service'
            ].forEach(function (name) {
                var script = document.createElement('script');
                script.src = prefix + name + assetSuffix + '.js';
                script.async = false;
                document.body.appendChild(script);

                var link = document.createElement('link');
                link.rel = 'stylesheet';
                link.href = prefix + name + assetSuffix + '.css';
                document.head.appendChild(link);
            });
        })();
    </script><script src="/preprints/assets/vendor-f46d275519d6cf7078493fc4564ccd3c7dc419ed.js"></script><script src="/preprints/assets/preprint-service-f46d275519d6cf7078493fc4564ccd3c7dc419ed.js"></script>

    <script src="https://cdn.ravenjs.com/3.22.1/ember/raven.min.js"></script>
                    <script>
                        var encodedConfig = document.head.querySelector("meta[name$='/config/environment']").content;
                        var config = JSON.parse(unescape(encodedConfig));
                        if (config.sentryDSN) {
                            Raven.config(config.sentryDSN, config.sentryOptions || {}).install();
                        }
                    </script>

    <div id="ember-basic-dropdown-wormhole"></div>
<div id="ember-bootstrap-wormhole"></div>
  

<div id="ember499" class="ember-view">
<!---->

    <div id="ember538" class="__new-osf-navbar__b7554 ember-view"><div class="osf-nav-wrapper">
    <nav id="navbarScope" role="navigation" class="navbar navbar-inverse navbar-fixed-top">
        <div class="container">
            <div class="navbar-header">
                <a href="/" aria-label="Go home" class="navbar-brand">
                    <span class="osf-navbar-logo"></span>
                </a>

                <div class="service-name">
                    <a href="https://osf.io/preprints/">
                        <span class="hidden-xs"> OSF </span>

I have tried printing all of the divs that are contained in the main body for example:

wormhole = driver.find_element(By.CLASS_NAME, 'ember-application')
divs = wormhole.find_elements(By.TAG_NAME, 'div')

I have tried finding via XPATH, ID, more or less everything. When I print the ID of each div and append it into a list I get this: ['MathJax_Message', 'ember-basic-dropdown-wormhole', 'ember-bootstrap-wormhole'] Funnily enough, when I print len(divs) i get 3 back, but when I dont append them into a list it takes an extra 2-3 seconds to finish executing once it reaches div[3], this generally does not tend to happen with other sites:

OUTPUT

MathJax_Message
ember-basic-dropdown-wormhole
ember-bootstrap-wormhole

Process finished with exit code 0

I have tried scrolling to the middle of the page in case its hidden, finding out what is in each of the 3 divs above it, going directly to the class names that I want, finding all elements of the webpage using find_elements(By.XPATH, '//*'). They all either only return the same 3 divs mentioned above, or they say 'element not found'. I cant think of what else to do/try.

Please guide me Stack Gods.


Solution

  • You need to provide delay time.sleep()

    driver.get('https://osf.io/preprints/discover?subject=bepress%7CSocial%20and%20Behavioral%20Sciences')
    time.sleep(5)
    print(len(driver.find_elements(By.CSS_SELECTOR,".ember-application div")))
    for ele in driver.find_elements(By.CSS_SELECTOR,".ember-application div"):
        print(ele.text)
    

    Output:

    275
    
    
    
    OSF PREPRINTS
    Add a Preprint
    Search
    Support
    Donate
    Sign Up Sign In
    Preprint Archive Search
    powered by
    Search
    2,365,609 searchable as of November 01, 2022
    Partner Repositories
    Previous
    Next
    Sort by: Relevance
    Active Filters:
    Clear filters
    Social and Behavioral Sciences
    Refine your search by
    Providers
    OSF Preprints (50,302)
    AfricArXiv (394)
    AgriXiv (426)
    Arabixiv (328)
    arXiv (1,324,846)
    BioHackrXiv (29)
    bioRxiv (48,109)
    BodoArXiv (110)
    Cogprints (283)
    CoP (1)
    EarthArXiv (1,755)
    EcoEvoRxiv (940)
    ECSarXiv (241)
    EdArXiv (1,088)
    engrXiv (2,088)
    FocUS Archive (52)
    Frenxiv (148)
    INA-Rxiv (16,605)
    IndiaRxiv (148)
    LawArXiv (1,374)
    LIS Scholarship Archive (310)
    MarXiv (456)
    MediArXiv (201)
    MetaArXiv (457)
    MindRxiv (286)
    NutriXiv (84)
    PaleorXiv (219)
    PeerJ (4,747)
    Preprints.org (21,880)
    PsyArXiv (25,458)
    RePEc (848,335)
    SocArXiv (11,629)
    SportRxiv (386)
    Thesis Commons (1,857)
    Subject
    Architecture
    Arts and Humanities
    Business
    Education
    Engineering
    Law
    Life Sciences
    Medicine and Health Sciences
    Physical Sciences and Mathematics
    Social and Behavioral Sciences
    Do you want to add your own research as a preprint?
    Add a preprint
    The Corporate Social Responsibility is just a twist in a M\"obius Strip
    Solferino, NazariaSolferino, Viviana
    Last edited: Oct 13, 2015 UTC
    Finance Social and Behavioral Sciences Economics
    In recent years economics agents and systems have became more and more interacting and juxtaposed, therefore the social sciences need to rely on the studies of physical sciences to analyze this complexity in the relationships. According to this point of view we rely on the geometrical model of the M ...
    arXiv
    Some suggestions on dealing with measurement error in linkage analyses
    Marko BachlMichael Scharkow
    Last edited: Jul 2, 2018 UTC
    Social and Behavioral Sciences Communication
    Linkage analysis is a sophisticated media effect research design that reconstructs the likely exposure to relevant media messages of individual survey respondents by complementing the survey data with a content analysis. It is an important improvement over survey-only designs: Instead of predicting ...
    OSF Preprints
    Imagined Interdependence: Manipulating Discourse Changes How People Construe Interdependence
    Jiří Münich..so on