Search code examples
cssrubyxpathnokogiriopen-uri

Nokogiri returning empty array


I'm screen scraping http://www.weather.com/weather/hourbyhour/l/INXX0202:1:IN.

I tried selecting using both CSS and XPath to get the precipitation forecast part of the table in the website.

Neither of them work in my program, because they return empty arrays, however, both work in Chrome Dev Tools (Inspect element -> console -> $$ for CSS, $x for Xpath).

Why is this happening? Does it have something to do with namespaces?

require 'open-uri'
require 'nokogiri'
foo = Nokogiri::HTML(open("http://www.weather.com/weather/hourbyhour/l/INXX0202:1:IN"))
foo.remove_namespaces!
p foo.xpath("//section[@data-ng-class]/p[@class='precip weather-cell ng-isolate-scope']/span[@data-ng-if]") # returns []
p foo.css("section[data-ng-class] p[class='precip weather-cell ng-isolate-scope'] span[data-ng-if]")  # returns []

Here is a screenshot of the website that I'm trying to get data from. What I want are the numbers under the heading "Precip" (Eg: 85,100,100,95,80,70,45,40 in the picture).

I copied the page's HTML into a local HTML file, and had my program access that file.The program then gave me the output I needed, but when I have the same program access the website using OpenUri, it returned an empty array:

require 'open-uri'
require 'nokogiri'
foo = open("http://www.weather.com/weather/hourbyhour/l/INXX0202:1:IN")
nokogirifoo = Nokogiri::HTML(foo)
p nokogirifoo.xpath("//section[@data-ng-class]/p[@class='precip weather-cell ng-isolate-scope']/span[@data-ng-if]") # => empty array

bar = File.open('weather.html') # weather.html is just the html code of the page copied into a local file
nokogiribar = Nokogiri::HTML(bar)
p nokogiribar.xpath("//section[@data-ng-class]/p[@class='precip weather-cell ng-isolate-scope']/span[@data-ng-if]").text # => "85%100%100%95%80%70%45%40%" (this is what I need)

Here is a snippet of the HTML (the part shown is nested within multiple tags in the website):

 <section class="wxcard-hourly summary-view ng-isolate-scope last" data-ng-class="{'last': $last}" data-wxcard-hourly="hour" data-wxcard-hourly-methods="hourlyScope" data-hours-index="hoursDataIndex" data-show-wx-labels="false" data-details-view="false">
    <div class="heading weather-cell" data-ng-switch="dataMethods.checkTime(data.getForecastLocalDate())">
        <h2>

      <span class="wx-dsxdate ng-binding ng-scope" ng-bind-template=" 9:30 am" data-dsxdate="" data-ng-switch-when="min" data-datetime="data.getForecastLocalDate()" data-timezone="locTz" data-format="'h:mm a'"> 9:30 am</span>
        </h2>
    <span class="sub-heading wx-hourly-date wx-dsxdate ng-binding ng-scope" ng-bind-template=" Fri, Nov 20" data-dsxdate="" data-datetime="data.getForecastLocalDate()" data-timezone="locTz" data-format="'EEE, MMM d'"> Fri, Nov 20</span>
    </div>
    <p class="hi-temp temp-1 weather-cell ng-isolate-scope" data-wx-temperature="data.getTemp()" data-show-temp-unit="hoursIndex === 0"> <span data-ng-if="hasValue()" data-ng-bind="temp" class="ng-binding ng-scope">28</span><sup data-ng-if="hasValue()" class="deg ng-scope">°</sup><sup class="temp-unit ng-binding ng-scope" data-ng-if="showTempUnit" data-ng-bind="tempUnit()">C</sup>
</p>
    <p class="feels-like temp-2 weather-cell ng-isolate-scope" data-wx-temperature="data.getFeelsLike()" data-temp-prefix="Feels"><span ng-if="tempPrefix" class="temp-prefix ng-binding ng-scope" data-ng-bind="tempPrefix">Feels</span><span data-ng-if="hasValue()" data-ng-bind="temp" class="ng-binding ng-scope">34</span><sup data-ng-if="hasValue()" class="deg ng-scope">°</sup>
</p>
    <div class="weather-cell">
        <h3 class="weather-phrase">
            <div class="weather-icon ng-isolate-scope wx-weather-icon" data-wxicon="" data-sky-code="data.getSkyCode()"><div class="svg-icon"><img src="/sites/all/modules/custom/angularmods/app/shared/wxicon/svgz/thunderstorm.svgz?1" aria-hidden="true" alt="thunderstorm"></div></div>

            <span class="phrase ng-binding" data-ng-bind-template="Thunderstorms">Thunderstorms</span>
        </h3>
    </div>
    <!-- The Next Line Is What I Need-->
    <p class="precip weather-cell ng-isolate-scope" data-wx-precip="dataMethods.roundedValue(data.getChanceOfPrecipDay())" data-wx-precip-type="data.getPrecipType()" data-wx-precip-sky-code="data.getSkyCode()"><span aria-hidden="true" class="wx-iconfont-global wx-icon-precip-rain-1"></span><span data-ng-if="!wxPrecipIconOnly" class="precip-val ng-binding ng-scope" data-ng-bind="chanceOfPrecip() | safeDisplay">85%</span></p>

    <p class="humidity-wrapper weather-cell">
      <span data-ng-bind-template="85%" class="humidity ng-binding ng-isolate-scope" data-wx-percentage="data.getHumidity()">85%</span>
    </p>

    <p class="wind-conditions weather-cell">
        <span class="wx-wind ng-binding ng-isolate-scope" data-ng-bind-template="ESE 9 km/h" data-wx-wind-direction="data.getWindDirectionText()" data-wx-wind-speed="data.getWindSpeed()">ESE 9 km/h</span>
    </p>
</section>

Solution

  • The problem is that you're using a browser to look at the page, which, in addition to implementing a HTML parser, also has an embedded JavaScript interpreter. Browsers find and act upon any JavaScript <script> tags, loading and adjusting elements prior to rendering the page for the user. That's what is happening in the page you want. Parsers, like Nokogiri, are NOT browsers, and don't care about embedded scripts because, in the HTML, a script is merely text inside a particular tag, and, as a result, that secondary HTML you want is never retrieved.

    You said you saved the HTML to a file, however, you didn't say how you saved it. I'm guessing, because the saved HTML contains the information you want, that it was saved using the browser.

    When working with web pages, the very first step is to determine whether the page uses dynamic HTML and/or JavaScript or is static HTML. Turn off JavaScript in your browser, and load the URL. Or, you can use wget or curl from the command-line to retrieve the page and look at it with an editor. In either case, do you see the content you want? If so, then odds are good you can get at it with a parser like Nokogiri after it's been retrieved. If you don't, then you have to use something that can interpret the JavaScript, process the loaded information, and then pass it to a parser.

    Tools like PhantomJS, and Watir can help you, or, instead, find a weather service that allows you to use an API to retrieve the data without scraping as scraping is always very fragile.

    It's also possible to figure out what URL the JavaScript is using to retrieve the data, then request that secondary resource and parse it. It might be HTML, or it might be JSON containing the data which is then processed by the JavaScript and the entire table is then built on the fly.

    There are many questions and answers on Stack Overflow discussing how to do all the above.

    That all said, once you do get the HTML you want, you can easily reduce the CSS selector needed for those values. Each value is wrapped in a <style> tag which has a class, so use that class to find the value.

    require 'nokogiri'
    doc = Nokogiri::HTML(<<EOT)
    
        <section class="wxcard-hourly summary-view ng-isolate-scope last" data-ng-class="{'last': $last}" data-wxcard-hourly="hour" data-wxcard-hourly-methods="hourlyScope" data-hours-index="hoursDataIndex" data-show-wx-labels="false" data-details-view="false">
            <div class="heading weather-cell" data-ng-switch="dataMethods.checkTime(data.getForecastLocalDate())">
                <h2>
    
              <span class="wx-dsxdate ng-binding ng-scope" ng-bind-template=" 9:30 am" data-dsxdate="" data-ng-switch-when="min" data-datetime="data.getForecastLocalDate()" data-timezone="locTz" data-format="'h:mm a'"> 9:30 am</span>
                </h2>
            <span class="sub-heading wx-hourly-date wx-dsxdate ng-binding ng-scope" ng-bind-template=" Fri, Nov 20" data-dsxdate="" data-datetime="data.getForecastLocalDate()" data-timezone="locTz" data-format="'EEE, MMM d'"> Fri, Nov 20</span>
            </div>
            <p class="hi-temp temp-1 weather-cell ng-isolate-scope" data-wx-temperature="data.getTemp()" data-show-temp-unit="hoursIndex === 0"> <span data-ng-if="hasValue()" data-ng-bind="temp" class="ng-binding ng-scope">28</span><sup data-ng-if="hasValue()" class="deg ng-scope">°</sup><sup class="temp-unit ng-binding ng-scope" data-ng-if="showTempUnit" data-ng-bind="tempUnit()">C</sup>
        </p>
            <p class="feels-like temp-2 weather-cell ng-isolate-scope" data-wx-temperature="data.getFeelsLike()" data-temp-prefix="Feels"><span ng-if="tempPrefix" class="temp-prefix ng-binding ng-scope" data-ng-bind="tempPrefix">Feels</span><span data-ng-if="hasValue()" data-ng-bind="temp" class="ng-binding ng-scope">34</span><sup data-ng-if="hasValue()" class="deg ng-scope">°</sup>
        </p>
            <div class="weather-cell">
                <h3 class="weather-phrase">
                    <div class="weather-icon ng-isolate-scope wx-weather-icon" data-wxicon="" data-sky-code="data.getSkyCode()"><div class="svg-icon"><img src="/sites/all/modules/custom/angularmods/app/shared/wxicon/svgz/thunderstorm.svgz?1" aria-hidden="true" alt="thunderstorm"></div></div>
    
                    <span class="phrase ng-binding" data-ng-bind-template="Thunderstorms">Thunderstorms</span>
                </h3>
            </div>
            <!-- The Next Line Is What I Need-->
            <p class="precip weather-cell ng-isolate-scope" data-wx-precip="dataMethods.roundedValue(data.getChanceOfPrecipDay())" data-wx-precip-type="data.getPrecipType()" data-wx-precip-sky-code="data.getSkyCode()"><span aria-hidden="true" class="wx-iconfont-global wx-icon-precip-rain-1"></span><span data-ng-if="!wxPrecipIconOnly" class="precip-val ng-binding ng-scope" data-ng-bind="chanceOfPrecip() | safeDisplay">85%</span></p>
    
            <p class="humidity-wrapper weather-cell">
              <span data-ng-bind-template="85%" class="humidity ng-binding ng-isolate-scope" data-wx-percentage="data.getHumidity()">85%</span>
            </p>
    
            <p class="wind-conditions weather-cell">
                <span class="wx-wind ng-binding ng-isolate-scope" data-ng-bind-template="ESE 9 km/h" data-wx-wind-direction="data.getWindDirectionText()" data-wx-wind-speed="data.getWindSpeed()">ESE 9 km/h</span>
            </p>
        </section>
    EOT
    

    Starting with a simple search:

    doc.at('.precip-val').text # => "85%"
    

    at finds the first matching Node and returns it. text retrieves its text-node.

    You want multiple nodes with that class, so something like this should help:

    doc.search('.precip-val').map(&:text) # => ["85%"]
    

    search finds all matching nodes and returns a NodeSet, which is like an array and can be iterated using map.

    It's unlikely they'll use .precip-val for non-precipitation tags wrapping values, but, if they did, try:

    doc.search('span.precip-val').map(&:text)
    

    and see what you get.