I'm going out of my mind trying to figure this out. Its been 3 days now and I'm about ready to give up. Below code should return a list, without repetition, of all phone numbers and emails on the clipboard.
#! python 3
#! Phone number and email address scraper
#take user input for:
#1. webpage to scrape
# - user will be prompted to copy a link
#2. file & location to save to
#3. back to 1 or exit
import pyperclip, re, os.path
#function for locating phone numbers
def phoneNums(clipboard):
phoneNums = re.compile(r'^(?:\d{8}(?:\d{2}(?:\d{2})?)?|\(\+?\d{2,3}\)\s?(?:\d{4}[\s*.-]?\d{4}|\d{3}[\s*.-]?\d{3}|\d{2}([\s*.-]?)\d{2}\1\d{2}(?:\1\d{2})?))$')
#(\+\d{1,4})? #Optional country code (optional: +, 1-4 digits)
#(\s)? #Optional space
#(\(\d\))? #Optional bracketed area code
#(\d\d(\s)?\d | \d{3}) #3 digits with optional space between
#(\s)? #Optional space
#(\d{3}) #3 digits
#(\s)? #Optional space
#(\d{4}) #Last four
#)
#)', re.VERBOSE)
#nos = phoneNums.search(clipboard) #ignore for now. Failed test of .group()
return phoneNums.findall(clipboard)
#function for locating email addresses
def emails(clipboard):
emails = re.compile(r'''(
[a-z0-9._%+-]* #username
@ #@ sign
[a-z0-9.-]+ #domain name
)''', re.I | re.VERBOSE)
return emails.findall(clipboard)
#function for copying email addresses and numbers from webpage to a file
def scrape(fileName, saveLoc):
newFile = os.path.join(saveLoc, fileName + ".txt")
#file = open(newFile, "w+")
#add phoneNums(currentText) +
print(currentText)
print(emails(currentText))
print(phoneNums(currentText))
#file.write(emails(currentText))
#file.close()
url = ''
currentText = ''
file = ''
location = ''
while True:
print("Please paste text to scrape. Press ENTER to exit.")
currentText = str(pyperclip.waitForNewPaste())
#print("Filename?")
#file = str(input())
#print("Where shall I save this? Defaults to C:")
#location = str(input())
scrape(file, location)
The emails return correctly but the phone number output for the hashed out section is as follows:
[('+30 210 458 6600', '+30', ' ', '', '210', '', ' ', '458', ' ', '6600'), ('+30 210 458 6601', '+30', ' ', '', '210', '', ' ', '458', ' ', '6601')]
As you can see, the numbers are being correctly identified but my code is being greedy so I try adding "+?":
def phoneNums(clipboard):
phoneNums = re.compile(r'''(
(\+\d{1,4})? #Optional country code (optional: +, 1-4 digits)
(\s)? #Optional space
(\(\d\))? #Optional bracketed area code
(\d\d(\s)?\d | \d{3}) #3 digits with optional space between
(\s)? #Optional space
(\d{3}) #3 digits
(\s)? #Optional space
(\d{4}) #Last four
)+?''', re.VERBOSE)
No joy. I tried plugging in a regex example from here: Find phone numbers in python script
Now I know that works because someone else has tested it. What I get is this:
Please paste text to scrape. Press ENTER to exit.
[] []
Please paste text to scrape. Press ENTER to exit.
[] [('', '', '', '', '', '', '','', '', '')]
...forever...
That last one isnt even allowing me to copy to the clipboard. .waitForNewPaste() should be doing what it says on the tin but the moment I run the code the program pulls whats on the clipboard and tries to process it (poorly).
I've obviously got a kink somewhere in my code but I cant see it. Any ideas?
As you pointed out, the regex works.
The input part '+30 210 458 6600' gets matched one time, and the result is a tuple of all the captured subgroups: ('+30 210 458 6600', '+30', ' ', '', '210', '', ' ', '458', ' ', '6600')
Note that the first element in the tuple is the entire match.
If you make all the groups non-capturing by inserting ?:
after the opening parenthesis, there will be no capturing groups left and the result will be only the full match '+30 210 458 6600' as a str
.
phoneNums = re.compile(r'''
(?:\+\d{1,4})? #Optional country code (optional: +, 1-4 digits)
(?:\s)? #Optional space
(?:\(\d\))? #Optional bracketed area code
(?:\d\d(?:\s)?\d | \d{3}) #3 digits with optional space between
(?:\s)? #Optional space
(?:\d{3}) #3 digits
(?:\s)? #Optional space
(?:\d{4}) #Last four
''', re.VERBOSE)
the code 'repeats forever' because the while True:
block is an infinite loop. If you want to stop after let's say one iteration you can put a break
statement at the end of the block the stop the loop.
while True:
currentText = str(pyperclip.waitForNewPaste())
scrape(file, location)
break