Search code examples
pythonregexurllib

Python regex: Difference between (.+) and (.+?)


I am new to regex and Python's urllib. I went through an online tutorial on web scraping and it had the following code. After studying up on regular expressions, it seemed to me that I could use (.+) instead of the (.+?) in my regex, but whoa was I wrong. I ended up printing way more html code than I wanted. I thought I was getting the hang of regex, but now I am confused. Please explain to me the difference between these two expressions and why it is grabbing so much html. Thanks!

ps. this is a starbucks stock quote scraper.

import urllib
import re

url = urllib.urlopen("http://finance.yahoo.com/q?s=SBUX")
htmltext = url.read()
regex = re.compile('<span id="yfs_l84_sbux">(.+?)</span>')
found = re.findall(regex, htmltext)

print found


Solution

  • .+ is greedy -- it matches until it can't match any more and gives back only as much as needed.

    .+? is not -- it stops at the first opportunity.

    Examples:

    Assume you have this HTML:

    <span id="yfs_l84_sbux">foo bar</span><span id="yfs_l84_sbux2">foo bar</span>
    

    This regex matches the whole thing:

    <span id="yfs_l84_sbux">(.+)<\/span>
    

    It goes all the way to the end, then "gives back" one </span>, but the rest of the regex matches that last </span>, so the complete regex matches the entire HTML chunk.

    But this regex stops at the first </span>:

    <span id="yfs_l84_sbux">(.+?)<\/span>