Search code examples
pythonhtmlxpathlxmlpython-requests-html

Problem with lxml.xpath not putting elements into a list


So here's my problem. I'm trying to use lxml to web scrape a website and get some information but the elements that the information pertains to aren't being found when using the var.xpath command. It's finding the page but after using the xpath it doesn't find anything.

import requests
from lxml import html

def main():
   result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')

   # the root of the tracker website
   page = html.fromstring(result.content)
   print('its getting the element from here', page)
   
   threesRank = page.xpath('//*[@id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tbody/tr[*]/td[3]/div/div[2]/div[1]/div')
   print('the 3s rank is: ', threesRank)

if __name__ == "__main__":
    main()

OUTPUT:
"D:\Python projects\venv\Scripts\python.exe" "D:/Python projects/main.py"

its getting the element from here <Element html at 0x20eb01006d0>
the 3s rank is:  []

Process finished with exit code 0

The output next to "the 3s rank is:" should look something like this

[<Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>]



Solution

  • Because the xpath string does not match, no result set is returned by page.xpath(..). It's difficult to say exactly what you are looking for but considering "threesRank" I assume you are looking for all the table values, ie. ranking and so on.

    You can get a more accurate and self-explanatory xpath using the Chrome Addon "Xpath helper". Usage: enter the site and activate the extension. Hold down the shift key and hoover on the element you are interested in.

    Since the HTML used by tracker.network.com is built dynamically using javascript with BootstrapVue (and Moment/Typeahead/jQuery) there is a big risk the dynamic rendering is producing different results from time to time.

    Instead of scraping the rendered html, I suggest you instead use the structured data needed for the rendering, which in this case is stored as json in a JavaScript variable called __INITIAL_STATE__

    import requests
    import re
    import json
    from contextlib import suppress
    
    # get page
    result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')
    
    # Extract everything needed to render the current page. Data is stored as Json in the
    # JavaScript variable: window.__INITIAL_STATE__={"route":{"path":"\u0 ... }};
    json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});", result.text).group(1)
    
    # convert text string to structured json data
    rocketleague = json.loads(json_string)
    
    # Save structured json data to a text file that helps you orient yourself and pick
    # the parts you are interested in.
    with open('rocketleague_json_data.txt', 'w') as outfile:
        outfile.write(json.dumps(rocketleague, indent=4, sort_keys=True))
    
    # Access members using names
    print(rocketleague['titles']['currentTitle']['platforms'][0]['name'])
    
    # To avoid 'KeyError' when a key is missing or index is out of range, use "with suppress"
    # as in the example below:  since there there is no platform no 99, the variable "platform99"
    # will be unassigned without throwing a 'keyerror' exception.
    from contextlib import suppress
    
    with suppress(KeyError):
        platform1 = rocketleague['titles']['currentTitle']['platforms'][0]['name']
        platform99 = rocketleague['titles']['currentTitle']['platforms'][99]['name']
    
    # print platforms used by currentTitle
    for platform in rocketleague['titles']['currentTitle']['platforms']:
        print(platform['name'])
    
    # print all titles with corresponding platforms
    for title in rocketleague['titles']['titles']:
        print(f"\nTitle: {title['name']}")
        for platform in title['platforms']:
            print(f"\tPlatform: {platform['name']}")