Search code examples
pythonregexpyperclip

Why is this regex greedy and why does the example code repeat forever?


I'm going out of my mind trying to figure this out. Its been 3 days now and I'm about ready to give up. Below code should return a list, without repetition, of all phone numbers and emails on the clipboard.

#! python 3
#! Phone number and email address scraper

#take user input for:
#1. webpage to scrape
# - user will be prompted to copy a link
#2. file & location to save to
#3. back to 1 or exit

import pyperclip, re, os.path

#function for locating phone numbers
def phoneNums(clipboard):
    phoneNums = re.compile(r'^(?:\d{8}(?:\d{2}(?:\d{2})?)?|\(\+?\d{2,3}\)\s?(?:\d{4}[\s*.-]?\d{4}|\d{3}[\s*.-]?\d{3}|\d{2}([\s*.-]?)\d{2}\1\d{2}(?:\1\d{2})?))$')
        #(\+\d{1,4})?                   #Optional country code (optional: +, 1-4 digits)
        #(\s)?                          #Optional space
        #(\(\d\))?                      #Optional bracketed area code
        #(\d\d(\s)?\d | \d{3})          #3 digits with optional space between
        #(\s)?                          #Optional space
        #(\d{3})                        #3 digits
        #(\s)?                          #Optional space
        #(\d{4})                        #Last four
        #)
        #)', re.VERBOSE)
    #nos = phoneNums.search(clipboard)  #ignore for now. Failed test of .group()

    return phoneNums.findall(clipboard)

#function for locating email addresses
def emails(clipboard):
    emails = re.compile(r'''(
        [a-z0-9._%+-]*     #username
        @                  #@ sign
        [a-z0-9.-]+        #domain name
        )''', re.I | re.VERBOSE)
    return emails.findall(clipboard)


#function for copying email addresses and numbers from webpage to a file
def scrape(fileName, saveLoc):
    newFile = os.path.join(saveLoc, fileName + ".txt")
    #file = open(newFile, "w+")
    #add phoneNums(currentText) +
    print(currentText)
    print(emails(currentText))
    print(phoneNums(currentText))
    #file.write(emails(currentText))
    #file.close()

url = ''
currentText = ''
file = ''
location =  ''

while True:
    print("Please paste text to scrape. Press ENTER to exit.")
    currentText = str(pyperclip.waitForNewPaste())
    #print("Filename?")
    #file = str(input())
    #print("Where shall I save this? Defaults to C:")
    #location = str(input())
    scrape(file, location)

The emails return correctly but the phone number output for the hashed out section is as follows:

[('+30 210 458 6600', '+30', ' ', '', '210', '', ' ', '458', ' ', '6600'), ('+30 210 458 6601', '+30', ' ', '', '210', '', ' ', '458', ' ', '6601')]

As you can see, the numbers are being correctly identified but my code is being greedy so I try adding "+?":

def phoneNums(clipboard):
    phoneNums = re.compile(r'''(
        (\+\d{1,4})?                   #Optional country code (optional: +, 1-4 digits)
        (\s)?                          #Optional space
        (\(\d\))?                      #Optional bracketed area code
        (\d\d(\s)?\d | \d{3})          #3 digits with optional space between
        (\s)?                          #Optional space
        (\d{3})                        #3 digits
        (\s)?                          #Optional space
        (\d{4})                        #Last four
        )+?''', re.VERBOSE)

No joy. I tried plugging in a regex example from here: Find phone numbers in python script

Now I know that works because someone else has tested it. What I get is this:

Please paste text to scrape. Press ENTER to exit. 
[] [] 
Please paste text to scrape. Press ENTER to exit. 
[] [('', '', '', '', '', '', '','', '', '')] 
...forever...

That last one isnt even allowing me to copy to the clipboard. .waitForNewPaste() should be doing what it says on the tin but the moment I run the code the program pulls whats on the clipboard and tries to process it (poorly).

I've obviously got a kink somewhere in my code but I cant see it. Any ideas?


Solution

  • As you pointed out, the regex works.

    The input part '+30 210 458 6600' gets matched one time, and the result is a tuple of all the captured subgroups: ('+30 210 458 6600', '+30', ' ', '', '210', '', ' ', '458', ' ', '6600')

    Note that the first element in the tuple is the entire match.

    If you make all the groups non-capturing by inserting ?: after the opening parenthesis, there will be no capturing groups left and the result will be only the full match '+30 210 458 6600' as a str.

        phoneNums = re.compile(r'''
            (?:\+\d{1,4})?                   #Optional country code (optional: +, 1-4 digits)
            (?:\s)?                          #Optional space
            (?:\(\d\))?                      #Optional bracketed area code
            (?:\d\d(?:\s)?\d | \d{3})        #3 digits with optional space between
            (?:\s)?                          #Optional space
            (?:\d{3})                        #3 digits
            (?:\s)?                          #Optional space
            (?:\d{4})                        #Last four
            ''', re.VERBOSE)
    

    the code 'repeats forever' because the while True: block is an infinite loop. If you want to stop after let's say one iteration you can put a break statement at the end of the block the stop the loop.

    while True:
        currentText = str(pyperclip.waitForNewPaste())
        scrape(file, location)
        break