Search code examples
pythonregexfindallurlopen

RE works in pythex but doesn't work in python


I am doing an assignment where I need to scrape information from live sites.

For this I am using https://www.nintendo.com/games/nintendo-switch-bestsellers, and need to scrape the game titles, prices and then the image sources. I have the titles working but the prices and image sources are just retuning empty list, though when put through pythex it is returning the right answer.

Here is my code:

from re import findall, finditer, MULTILINE, DOTALL
from urllib.request import urlopen

game_html_source = urlopen\
('https://www.nintendo.com/games/nintendo-switch-bestsellers').\
read().decode("UTF-8")

# game titles - working
game_title = findall(r'<h3 class="b3">([A-Z a-z:0-9]+)</h3>', game_html_source)
print(game_title)

# game prices - retuning empty-list
game_prices = findall(r'<p class="b3 row-price">(\$[.0-9]+)</p>', game_html_source)
print(game_prices)

# game images - returning empty list
game_images = findall(r'<img alt="[A-Z a-z:]+" src=("https://media.nintendo.com/nintendo/bin/[A-Za-z0-9-\/_]+.png")>',game_html_source)
print(game_images)

Solution

  • Parsing HTML with regex has too many pitfalls for reliable processing. BeautifulSoup and other HTML parsers work by building a complete document data structure, which you then navigate to extract the interesting bits - it's thorough and comprehensive, but if there is some erroneous HTML anywhere in the source, even if its in a part you don't care about, it can defeat the parsing process. Pyparsing takes a middle approach - you can define mini-parsers that match just the bits you want, and skip over everything else (this simplifies the post-parsing navigation too). To address some of the variabilities in HTML styles, pyparsing provides a function makeHTMLTags which returns a pair of pyparsing expressions for the opening and closing tags:

    foo_start, foo_end = pp.makeHTMLTags('foo')
    

    foo_start will match:

    <foo>
    <foo/>
    <foo class='bar'>
    <foo href=something_not_in_quotes>
    

    and many more variations of attributes and whitespace.

    The foo_start expression (like all pyparsing expressions) will return a ParseResults object. This makes it easy to access the parts of the parsed tag:

    foo_data = foo_start.parseString("<foo img='bar.jpg'>")
    print(foo_data.img)
    

    For your Nintendo page scraper, see the annotated source below:

    import pyparsing as pp
    
    # define expressions to match opening and closing tags <h3>
    h3, h3_end = pp.makeHTMLTags("h3")
    
    # define a specific type of <h3> tag that has the desired 'class' attribute
    h3_b3 = h3().addCondition(lambda t: t['class'] == "b3")
    
    # similar for <p>
    p, p_end = pp.makeHTMLTags("p")
    p_b3_row_price = p().addCondition(lambda t: t['class'] == "b3 row-price")
    
    # similar for <img>
    img, _ = pp.makeHTMLTags("img")
    img_expr = img().addCondition(lambda t: t.src.startswith("//media.nintendo.com/nintendo/bin"))
    
    # define expressions to capture tag body for title and price - include negative lookahead for '<' so that
    # tags with embedded tags are not matched
    LT = pp.Literal('<')
    title_expr = h3_b3 + ~LT + pp.SkipTo(h3_end)('title') + h3_end
    price_expr = p_b3_row_price + ~LT + pp.SkipTo(p_end)('price') + p_end
    
    # compose a scanner expression by '|'ing the 3 sub-expressions into one
    scanner = title_expr | price_expr | img_expr
    
    # not shown - read web page into variable 'html'
    
    # use searchString to search through the retrieved HTML for matches
    for match in scanner.searchString(html):
        if 'title' in match:
            print("Title:", match.title)
        elif 'price' in match:
            print("Price:", match.price)
        elif 'src' in match:
            print("Img src:", match.src)
        else:
            print("???", match.dump())
    

    The first few matches printed are:

    Img src: //media.nintendo.com/nintendo/bin/SF6LoN-xgX1iT617eWfBrNcWH6RQXnSh/I_IRYaBzJ61i-3hnYt_k7hVxHtqGmM_w.png
    Title: Hyrule Warriors: Definitive Edition
    Price: $59.99
    Img src: //media.nintendo.com/nintendo/bin/wcfCyAd7t2N78FkGvEwCOGzVFBNQRbhy/AvG-_d4kEvEplp0mJoUew8IAg71YQveM.png
    Title: Donkey Kong Country: Tropical Freeze
    Price: $59.99
    Img src: //media.nintendo.com/nintendo/bin/QKPpE587ZIA5fUhUL4nSbH3c_PpXYojl/J_Wd79pnFLX1NQISxouLGp636sdewhMS.png
    Title: Wizard of Legend
    Price: $15.99