I have a series of geographic coordinates as below and am fairly new to Series. I need to extract the latitude and longitude from these coordinates into separate series to concatenate back into the dataframe. I have tried a few different methods w/o luck. Can someone please help me understand how best to do this?
Code I have tried:
<list>
but just returns the first record in the seriesdf['Location'].str.split(',')[0]
longitude = df['Location'].str[0][0].replace('(', '')
The Location field is a series of lists of tuples.
0 [(34.0255, -118.3002)]
1 [(34.0256, -118.3089)]
2 [(34.0738, -118.2078)]
3 [(34.0492, -118.2391)]
4 [(34.0108, -118.3182)]
...
594800 [(34.0436, -118.3053)]
594801 [(33.9572, -118.3962)]
594802 [(34.1684, -118.3982)]
594803 [(34.1976, -118.3965)]
594804 [(34.0504, -118.3442)]
Based on the data from the comment, Location
is actually a string, containing a tuple. And there's a None among the values.
So we need to parse the coordinates and be careful of Nones.
I have tried two approaches
ast.literal_eval
to parse a tuple from a string. Convert a String representation of a Dictionary to a dictionaryimport ast
def get_coordinate_ast(location, coord_idx):
try:
return ast.literal_eval(location)[coord_idx]
except Exception as e:
print(e)
return
%%time
# getting lat
df["lat_ast"] = df["Location"].apply(lambda l: get_coordinate_ast(l, 0))
# getting long
df["lon_ast"] = df["Location"].apply(lambda l: get_coordinate_ast(l, 1))
print(f"Not able to parse lat: {df['lat_ast'].isna().sum()}")
print(f"Not able to parse lon: {df['lon_ast'].isna().sum()}")
Output:
malformed node or string: nan
malformed node or string: nan
Not able to parse lat: 1
Not able to parse lon: 1
CPU times: user 6.41 s, sys: 57.5 ms, total: 6.47 s
Wall time: 6.48 s
%%time
# getting lat
df["lat_re"] = df["Location"].apply(lambda l: get_coordinate_re(l, 0))
# getting long
df["lon_re"] = df["Location"].apply(lambda l: get_coordinate_re(l, 1))
print(f"Not able to parse lat: {df['lat_re'].isna().sum()}")
print(f"Not able to parse lon: {df['lon_re'].isna().sum()}")
Output:
expected string or bytes-like object
expected string or bytes-like object
Not able to parse lat: 1
Not able to parse lon: 1
CPU times: user 941 ms, sys: 16.8 ms, total: 958 ms
Wall time: 960 ms
Both methods produce only 1 unparsed value (last row). And give equal results
print((df["lat_ast"] != df["lat_re"]).sum())
print((df["lon_ast"] != df["lon_re"]).sum())
Output:
1
1
(These are both Nones, that didn't match)
However, re
method is much more quicker: 960 ms
vs 6.48 s
.