split a string from a list in a Pandas series

I have a series of geographic coordinates as below and am fairly new to Series. I need to extract the latitude and longitude from these coordinates into separate series to concatenate back into the dataframe. I have tried a few different methods w/o luck. Can someone please help me understand how best to do this?

Code I have tried:

printing type of this is a `<list>` but just returns the first record in the series

df['Location'].str.split(',')[0]

treat as string and try to extract the first list element and then the first of the tuple

longitude = df['Location'].str[0][0].replace('(', '')

The Location field is a series of lists of tuples.

0         [(34.0255,  -118.3002)]
1         [(34.0256,  -118.3089)]
2         [(34.0738,  -118.2078)]
3         [(34.0492,  -118.2391)]
4         [(34.0108,  -118.3182)]
                   ...
594800    [(34.0436,  -118.3053)]
594801    [(33.9572,  -118.3962)]
594802    [(34.1684,  -118.3982)]
594803    [(34.1976,  -118.3965)]
594804    [(34.0504,  -118.3442)]

Solution

Based on the data from the comment, Location is actually a string, containing a tuple. And there's a None among the values.

So we need to parse the coordinates and be careful of Nones.

I have tried two approaches

Using ast.literal_eval to parse a tuple from a string. Convert a String representation of a Dictionary to a dictionary

import ast

def get_coordinate_ast(location, coord_idx):
    try:
        return ast.literal_eval(location)[coord_idx]
    except Exception as e:
        print(e)
        return

%%time
# getting lat
df["lat_ast"] = df["Location"].apply(lambda l: get_coordinate_ast(l, 0))
# getting long
df["lon_ast"] = df["Location"].apply(lambda l: get_coordinate_ast(l, 1))

print(f"Not able to parse lat: {df['lat_ast'].isna().sum()}")
print(f"Not able to parse lon: {df['lon_ast'].isna().sum()}")

Output:

malformed node or string: nan
malformed node or string: nan
Not able to parse lat: 1
Not able to parse lon: 1
CPU times: user 6.41 s, sys: 57.5 ms, total: 6.47 s
Wall time: 6.48 s

Using regexp:

%%time
# getting lat
df["lat_re"] = df["Location"].apply(lambda l: get_coordinate_re(l, 0))
# getting long
df["lon_re"] = df["Location"].apply(lambda l: get_coordinate_re(l, 1))

print(f"Not able to parse lat: {df['lat_re'].isna().sum()}")
print(f"Not able to parse lon: {df['lon_re'].isna().sum()}")

Output:

expected string or bytes-like object
expected string or bytes-like object
Not able to parse lat: 1
Not able to parse lon: 1
CPU times: user 941 ms, sys: 16.8 ms, total: 958 ms
Wall time: 960 ms

Both methods produce only 1 unparsed value (last row). And give equal results

print((df["lat_ast"] != df["lat_re"]).sum())
print((df["lon_ast"] != df["lon_re"]).sum())

Output:

1
1

(These are both Nones, that didn't match)

However, re method is much more quicker: 960 ms vs 6.48 s.

split a string from a list in a Pandas series

printing type of this is a <list> but just returns the first record in the series

treat as string and try to extract the first list element and then the first of the tuple

printing type of this is a `<list>` but just returns the first record in the series