I am working on a Python program that searches through received emails and returns coordinates. I am trying to create a regular expression to select the Lat/long values from a string. (I am new to regex)
Here is a small example of one of the strings I have been using for testing:
content = """
WorkLocationBoundingBox
Latitude:30.556555Longitude:-97.659824
SecondLatitude:30.569138SecondLongitude:-97.650855
"""
I came up with Latitude:(\d+).(\d+)Longitude:(.*)
, which I believe is close to what I need, but it sperates 30
and 556555
into seperate groups. But, -97.659824
is correctly placed into a group.
My ideal expected result would look something this:
[(30.556555, -97.659824, 30.569138, -97.650855)]
You can use 3 capture groups, where the first group is used to match up the word before Long or Latitude.
((?:Second)?)Latitude:(-?\d+(?:\.\d+)?)\1Longitude:(-?\d+(?:\.\d+)?)
((?:Second)?)
Capture group 1, optionally match Second
Latitude:
Match literally(-?\d+(?:\.\d+)?)
Capture group 2, match an optional -
then 1+ digits with an optional decimal part\1Longitude:
A Backreference to what is matched in group 1 and match Longitude:
(-?\d+(?:\.\d+)?)
Capture group 3, match an optional -
then 1+ digits with an optional decimal partRegex demo or a Python demo
import re
regex = r"((?:Second)?)Latitude:(-?\d+(?:\.\d+)?)\1Longitude:(-?\d+(?:\.\d+)?)"
s = ("WorkLocationBoundingBox\n"
"Latitude:30.556555Longitude:-97.659824\n"
"SecondLatitude:30.569138SecondLongitude:-97.650855")
matches = re.finditer(regex, s)
lst = []
for matchNum, match in enumerate(matches, start=1):
lst.append(match.group(2))
lst.append(match.group(3))
print(lst)
Output
['30.556555', '-97.659824', '30.569138', '-97.650855']
A bit less strict pattern could be matching optional word character before either Longitude or Latitude:
\w*Latitude:(-?\d+(?:\.\d+)?)\w*Longitude:(-?\d+(?:\.\d+)?)
In that case, you might also use re.findall to return the group values in a list of tuples if you want:
import re
pattern = r"\w*Latitude:(-?\d+(?:\.\d+)?)\w*Longitude:(-?\d+(?:\.\d+)?)"
s = ("WorkLocationBoundingBox\n"
"Latitude:30.556555Longitude:-97.659824\n"
"SecondLatitude:30.569138SecondLongitude:-97.650855")
print(re.findall(pattern, s))
Output
[('30.556555', '-97.659824'), ('30.569138', '-97.650855')]