Search code examples
pythonhtmlxpathwikipediamovie

xpath - How do I get a node that may or may not contain a parent node


I'm currently building a Python script that will pull all Oscar nominated Best Picture movies from the wikipedia page. I've made two different lists for the winners and the nominees.

from lxml import etree
import requests
r = requests.get('https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture')
doc = etree.fromstring(r.text)
winners = doc.xpath('//tr[@style="background:#FAEB86"]/td/i/b/a')
nominees = doc.xpath('//tr/td/i/a')

As you can see, I'm focusing on the last node as that has both the name of the movie. I am able to get all the movies on for each list, but I want to put them in one list together using xpath. I know I could merge the two lists together, but the movies have to be in the order they appear on the wiki page.

The main problem comes from the nodes with @style and /b, which both come before /a. I tried putting them together in a single line

winners = doc.xpath('//tr[@style="background:#FAEB86" or not(@style="background:#FAEB86")]/td/i[b or not(b)]/a')

but I only get the most recent winner (Moonlight) at the beginning of the list, and the rest of the list is just the nominated movies.

Is it possible to put my two list together in a single statement, or will I have to write a work around that puts the movies in the correct oder?


Solution

  • I would do it like this:

    //table[@class="wikitable"]//tr/td[1][not(@rowspan)]//a
    
    • //table[@class="wikitable"] selects only the tables with the films.
    • //tr/td[1][not(@rowspan)] selects the first field of each row, excluding big ones that list just the year.