Search code examples
javascriptweb-scrapingcasperjstext-extractionhtml-content-extraction

Extract list of texts with CasperJS


I want to extract the text values out of this List:

<ul class="standardSuggestions">
    <li class="">

        <div id="idac">
            <span class="email" id="idb7"><span>mail-fuer-chrisko</span>@<span>web.de</span></span>
            <span class="btn-positioner"><span class="btn-wrapper btn-fix btn-service btn-xs"><input name="wishnamePanel:suggestionsContainerWrapper:freeMailSuggestionsPanel:standard-suggestion-list:suggestionRepeaterContainer:suggestion-to-repeat:1:suggestion:subForm:select-email" id="idae" value="Übernehmen" type="submit"></span></span>
        </div>

    </li><li class="">

        <div id="idaf">
            <span class="email" id="idb8"><span>post-fuer-chrisko</span>@<span>web.de</span></span>
            <span class="btn-positioner"><span class="btn-wrapper btn-fix btn-service btn-xs"><input name="wishnamePanel:suggestionsContainerWrapper:freeMailSuggestionsPanel:standard-suggestion-list:suggestionRepeaterContainer:suggestion-to-repeat:2:suggestion:subForm:select-email" id="idb0" value="Übernehmen" type="submit"></span></span>
        </div>

    </li><li class="">

        <div id="idb1">
            <span class="email" id="idb9"><span>chrisko1</span>@<span>web.de</span></span>
            <span class="btn-positioner"><span class="btn-wrapper btn-fix btn-service btn-xs"><input name="wishnamePanel:suggestionsContainerWrapper:freeMailSuggestionsPanel:standard-suggestion-list:suggestionRepeaterContainer:suggestion-to-repeat:3:suggestion:subForm:select-email" id="idb2" value="Übernehmen" type="submit"></span></span>
        </div>

    </li><li class="">

        <div id="idb3">
            <span class="email" id="idba"><span>chrisko.1</span>@<span>web.de</span></span>
            <span class="btn-positioner"><span class="btn-wrapper btn-fix btn-service btn-xs"><input name="wishnamePanel:suggestionsContainerWrapper:freeMailSuggestionsPanel:standard-suggestion-list:suggestionRepeaterContainer:suggestion-to-repeat:4:suggestion:subForm:select-email" id="idb4" value="Übernehmen" type="submit"></span></span>
        </div>

    </li>
</ul>

Problem is that the div id = "" is changing on every reload. So I'm not sure how to select correct elements. I tried it with the following function:

casper.then(function(){
    var listItems = this.evaluate(function () {
        var nodes = document.querySelectorAll('ul > li');
        return [].map.call(nodes, function(node) {
            return {
                text: node.querySelector("span").textContent
            };
        });
    });
    this.echo(JSON.stringify(listItems, undefined, 4)); 
});

echo is "null" :-(


Solution

  • Your iteration over the elements is correct. The only way to get a null value out of the page context is if there was an error. The only part of the code that can produce an error is node.querySelector("span").textContent, because a node doesn't necessarily have to have a <span> descendant. If it doesn't have one, then this fails with a TypeError and you get null.

    The limited markup that you've shown always contains a <span> in every <li>, so there must be another <ul> on the page, that has no <span> descendants. You have to find a CSS selector with doesn't include the other <ul> element.

    I propose

    var nodes = document.querySelectorAll('ul.standardSuggestions > li');