Search code examples
palantir-foundryfoundry-code-repositoriesgeospark

Best approach for geospatial indexes in Palantir Foundry


What the recommended approach is for building a pipeline that needs to find a point contained in a polygon (shape) in Planatir Foundry? In the past, this has been pretty difficult in Spark. GeoSpark has been pretty popular, but can still lag. If there is nothing specific to Foundry I can implement something with Geospark. I have ~13k shapes and batches of thousands of points.


Solution

  • How large are the datasets? With a big enough driver and some optimizations, I previously got it working using geopandas. Just make sure that the coordinate point is the same projection as the polygon.

    Here is a helper function:

    from shapely import geometry
    import json
    import geopandas
    from pyspark.sql import functions as F
    
    
    
    def geopandas_spatial_join(df_left, df_right, geometry_left, geometry_right, how='inner', op='intersects'):
        '''
        Computes a spatial join of two Geopandas dataframes. Implemetns the Geopandas "sjoin" method, reference: https://geopandas.org/reference/geopandas.sjoin.html.
        Expects both dataframes to contain a GeoJSON geometry column, whose names are passed as the 'geometry_left' and 'geometry_right' arguments/
    
        Inputs:
            df_left (PANDAS_DATAFRAME): Left input dataframe.
            df_right (PANDAS_DATAFRAME): Right input dataframe.
            geometry_left (string): Name of the geometry column of the left dataframe.
            geometry_right (string): Name of the geometry column of the right dataframe.
            how (string): The type of join, one of {'left', 'right', 'inner'}.
            op (string): Binary predicate, one of {‘intersects’, ‘contains’, ‘within’}.
    
        Outputs:
            (PANDAS_DATAFRAME): Joined dataframe.
        '''
    
        df1 = df_left
        df1["geometry_left_shape"] = df1[geometry_left].apply(json.loads)
        df1["geometry_left_shape"] = df1["geometry_left_shape"].apply(geometry.shape)
        gdf_left = geopandas.GeoDataFrame(df1, geometry="geometry_left_shape")
    
        df2 = df_right
        df2["geometry_right_shape"] = df2[geometry_right].apply(json.loads)
        df2["geometry_right_shape"] = df2["geometry_right_shape"].apply(geometry.shape)
        gdf_right = geopandas.GeoDataFrame(df2, geometry="geometry_right_shape")
    
        joined = geopandas.sjoin(gdf_left, gdf_right, how=how, op=op)
        joined = joined.drop(joined.filter(items=["geometry_left_shape", "geometry_right_shape"]).columns, axis=1)
    
        return joined
    

    We can then run the join:

        import pandas as pd
    
        left_df = points_df.toPandas()
        left_geo_column = "point_geometry"
    
        right_df = polygon_df.toPandas()
        right_geo_column = "polygon_geometry"
    
        pdf = geopandas_spatial_join(left_df,right_df,left_geo_column,right_geo_column)
    
        return_df = spark.createDataFrame(pdf).dropDuplicates()
    
        return return_df