Search code examples
pandasdataframeseries

split a string from a list in a Pandas series


I have a series of geographic coordinates as below and am fairly new to Series. I need to extract the latitude and longitude from these coordinates into separate series to concatenate back into the dataframe. I have tried a few different methods w/o luck. Can someone please help me understand how best to do this?

Code I have tried:

printing type of this is a <list> but just returns the first record in the series

df['Location'].str.split(',')[0]

treat as string and try to extract the first list element and then the first of the tuple

longitude = df['Location'].str[0][0].replace('(', '')

The Location field is a series of lists of tuples.

0         [(34.0255,  -118.3002)]
1         [(34.0256,  -118.3089)]
2         [(34.0738,  -118.2078)]
3         [(34.0492,  -118.2391)]
4         [(34.0108,  -118.3182)]
                   ...
594800    [(34.0436,  -118.3053)]
594801    [(33.9572,  -118.3962)]
594802    [(34.1684,  -118.3982)]
594803    [(34.1976,  -118.3965)]
594804    [(34.0504,  -118.3442)]

Solution

  • Based on the data from the comment, Location is actually a string, containing a tuple. And there's a None among the values.

    So we need to parse the coordinates and be careful of Nones.

    I have tried two approaches

    1. Using ast.literal_eval to parse a tuple from a string. Convert a String representation of a Dictionary to a dictionary
    import ast
    
    def get_coordinate_ast(location, coord_idx):
        try:
            return ast.literal_eval(location)[coord_idx]
        except Exception as e:
            print(e)
            return
    
    %%time
    # getting lat
    df["lat_ast"] = df["Location"].apply(lambda l: get_coordinate_ast(l, 0))
    # getting long
    df["lon_ast"] = df["Location"].apply(lambda l: get_coordinate_ast(l, 1))
    
    print(f"Not able to parse lat: {df['lat_ast'].isna().sum()}")
    print(f"Not able to parse lon: {df['lon_ast'].isna().sum()}")
    

    Output:

    malformed node or string: nan
    malformed node or string: nan
    Not able to parse lat: 1
    Not able to parse lon: 1
    CPU times: user 6.41 s, sys: 57.5 ms, total: 6.47 s
    Wall time: 6.48 s
    
    1. Using regexp:
    %%time
    # getting lat
    df["lat_re"] = df["Location"].apply(lambda l: get_coordinate_re(l, 0))
    # getting long
    df["lon_re"] = df["Location"].apply(lambda l: get_coordinate_re(l, 1))
    
    print(f"Not able to parse lat: {df['lat_re'].isna().sum()}")
    print(f"Not able to parse lon: {df['lon_re'].isna().sum()}")
    

    Output:

    expected string or bytes-like object
    expected string or bytes-like object
    Not able to parse lat: 1
    Not able to parse lon: 1
    CPU times: user 941 ms, sys: 16.8 ms, total: 958 ms
    Wall time: 960 ms
    

    Both methods produce only 1 unparsed value (last row). And give equal results

    print((df["lat_ast"] != df["lat_re"]).sum())
    print((df["lon_ast"] != df["lon_re"]).sum())
    

    Output:

    1
    1
    

    (These are both Nones, that didn't match)

    However, re method is much more quicker: 960 ms vs 6.48 s.