Search code examples
pythonxpathweb-scrapinglxmlboolean-operations

Using Boolean value to execute different XPath expressions with Python lxml


I am trying to scrape weather data from a website using a python script and lxml. Wind speed data will be pulled and appended to a list for later manipulation. I am able to get the information I need just fine when it is formatted thusly:

<div class = "day-fcst">
  <div class = "wind">
    <div class = "gust">
      "Gusts to 20-30mph"
    </div>
  </div>
</div>

However, when there are low winds present the websites adds a child span class under the "gust" div like so:

<div class = "gust">
  <span class = "nowind">
    "Gusts less than 20mph"
  </span
</div>

My thought process was to check if span exists, if true then execute an XPath expression to pull text under span, otherwise execute an XPath expression just to pull text under the "gust" div. I tried searching for examples of using XPath Boolean functions, but was unable to make anything work (neither in Safari's Web Inspector or in my script).

My current code uses Python to check to see if the span class is equivalent to "nowind" and then executes the if and else statements, but only the else statement gets executed. My current code looks like this:

from lxml import html
import requests

wind = []

source=requests.get('website')
tree = html.fromstring(source.content)

if tree.xpath('//div[@class = "day-fcst"]/div[@class = "wind"]/div[@class = "gust"]/span/@class') == 'nowind':
  wind.append(tree.xpath('//div[@class = "day-fcst"]/div[@class = "wind"]/div[@class = "gust"]/span/text()'))
else:
  wind.append(tree.xpath('//div[@class = "day-fcst"]/div[@class = "wind"]/div[@class = "gust"]/text()'))

print wind

I'd like to solve this with an XPath expression that results in a Boolean value as opposed to my current workaround. Any help would be appreciated. I am still new to using XPath, so I am unfamiliar with utilizing any of its functions.


Solution

  • it's possible to have them same xpath expression for both cases. Just use //div[@class = "day-fcst"]/div[@class = "wind"]/div[@class = "gust"]//text()

    Alternatively you could get <div class = "wind"> element and than use text_content() method in order to get text content.

    In [1]: from lxml import html
    
    In [2]: first_html = '<div class = "day-fcst"><div class = "wind"><div class = "gust">"Gusts to 20-30mph"</div></div></div>'
    
    In [3]: second_html = '<div class = "day-fcst"><div class = "wind"><div class = "gust"><span class = "nowind">"Gusts to 20-30mph"</span></div></div></div>'
    
    In [4]: f = html.fromstring(first_html)
    
    In [5]: s = html.fromstring(second_html)
    
    In [6]: f.xpath('//div[@class = "day-fcst"]/div[@class = "wind"]/div[@class = "gust"]')[0].text_content()
    Out[6]: '"Gusts to 20-30mph"'
    
    In [7]: s.xpath('//div[@class = "day-fcst"]/div[@class = "wind"]/div[@class = "gust"]')[0].text_content()
    Out[7]: '"Gusts to 20-30mph"'
    
    In [8]: print(f.xpath('//div[@class = "day-fcst"]/div[@class = "wind"]/div[@class = "gust"]//text()'))
    ['"Gusts to 20-30mph"']
    
    In [9]: print(s.xpath('//div[@class = "day-fcst"]/div[@class = "wind"]/div[@class = "gust"]//text()'))
    ['"Gusts to 20-30mph"']