Search code examples
pythongeojsongeopandas

Convert string of lat lon to geojson polygon


I have the foll. dataframe:

col_a   col_b   col_c   lat lon polyline                                                            
0   2.2 3/27/2017 17:45 -34.92967678    -62.34831333    [{lat":-34.92967677667683   lng:-62.34831333160395} {"lat":-34.93002861969753   lng:-62.360866069793644}    {"lat":-34.93526211379422   lng:-62.36063016609785} {"lat":-34.93571078689853   lng:-62.35996507775451} {"lat":-34.935798629937075  lng:-62.34816312789911} {"lat":-34.9333358703344    lng:-62.34824895858759} {"lat":-34.9320340961022    lng:-62.348334789276066}]"      
0   3.3 3/27/2017 17:45 -34.92967678    -62.34831333    [{lat":-34.92967677667683   lng:-62.34831333160395} {"lat":-34.93002861969753   lng:-62.360866069793644}    {"lat":-34.93526211379422   lng:-62.36063016609785} {"lat":-34.93571078689853   lng:-62.35996507775451} {"lat":-34.935798629937075  lng:-62.34816312789911} {"lat":-34.9333358703344    lng:-62.34824895858759} {"lat":-34.9320340961022    lng:-62.348334789276066}]"      

I would like to convert it into a geopandas dataframe (with geometry information from polyline), but the polyline column is not in a standard format. How to fix this?


Solution

  • IIUC, if the original dataframe is a Pandas dataframe, then you can try using Series.str.translate to remove all double quotes and use Series.str.findall to retrieve all lat-long pairs into a list of tuples and then assign coordinates to create the Polygon(notice we use map(float,x) to convert lat/long from str to float):

    import pandas as pd
    import geopandas as gpd
    from shapely.geometry import Polygon
    
    df['coords'] = df.polyline \
        .str.translate(str.maketrans({'"':''})) \
        .str.findall(r'\blat:(-?\d+\.\d+)\s+lng:(-?\d+\.\d+)')
    
    geometry = [ Polygon([(float(x), float(y)) for x,y in e]) for e in df['coords'] ]
    
    gdf = gpd.GeoDataFrame(df.drop(['coords','polyline'], axis=1), geometry=geometry)
    

    Edit: if the methods under pandas.Series.str are not available, you can do the same using Python re module, for example: (assume the original dataframe is a geodataframe named gdf)

    import re
    ptn = re.compile(r'\blat:(-?\d+\.\d+)\s+lng:(-?\d+\.\d+)')
    geometry = [ Polygon(tuple(map(float,x)) for x in re.findall(ptn, x.replace('"',''))) for e in gdf["polyline"] ]
    gdf_new = gpd.GeoDataFrame(gdf, geometry=geometry)