Search code examples
pythonpython-polarsshapelywktmultipolygons

Create MultiPolygon objects from struct type and list[list[list[f64]]] columns using Polars


I have downloaded the NYC Taxi Zones dataset (downloaded from SODA Api and saved as json file - Not GeoJson or Shapefile). The dataset is rather small, thus I am using the whole information included. For the convenience of the post I am presenting the first 2 rows of the dataset:

  • original (with struct type value of the_geom).
  • dataset after unpacking the struct type with unnest() command in polars. --updated with write_ndjson() command

The original dataset: enter image description here

Below the dataset after applying unnest() and selecting some columns and the first 2 rows.

enter image description here

You may import the data using polars with the following command

import polars as pl
poc = pl.read_json("./data.json"))

I am interested in the multipolygons. I am actually trying to re-calculate the shape_area by using the multipolygon and wkt (Well-Known-Text representation) - method used by shapely module.

What I have done so far is to use the column coordinates and transform it to a MultiPolygon() object - readable by the Shapely module.

def flatten_list_of_lists(data):
    return [subitem3 for subitem1 in data for subitem2 in subitem1 for subitem3 in subitem2]

The function takes as input a list[list[list[list[f64]]]] object and transforms to a list[list[f64]] object.

flattened_lists = [flatten_list_of_lists(row) for row in poc["coordinates"].to_list()]

print(flattened_lists)

[[[-74.18445299999996, 40.694995999999904], [-74.18448899999999, 40.69509499999987], [-74.18449799999996, 40.69518499999987], [-74.18438099999997, 40.69587799999989], [-74.18428199999994, 40.6962109999999], [-74.18402099999997, 40.69707499999989]...

Then I use the function below that applies string concatenation and basically:

  • Transforms the list[list[f64]] object to String.
  • Adds the keyword MultiPolygon in front of the string.
  • Replaces '[' and ']' with '('., ')' respectively.
def polygon_to_wkt(polygon):
    # Convert each coordinate pair to a string formatted as "lon lat"
    coordinates_str = ", ".join(f"{lon} {lat}" for lon, lat in polygon)
    # Format into a WKT Multipolygon string (each polygon as a single polygon multipolygon)
    return f"""MULTIPOLYGON ((({coordinates_str})))"""

formatted_wkt = [polygon_to_wkt(polygon) for polygon in flattened_lists]
poc = poc.with_columns(pl.Series("WKT", formatted_wkt))

enter image description here

Finally, I use the method wkt.loads("MultiPolygon ((()))").area to compute the shape area of the Multipolygon object

def convert_to_shapely_area(wkt_str):
    try:
        return wkt.loads(wkt_str).area
    except Exception as e:
        print("Error converting to WKT:", e)
        return None
poc = poc.with_columns(
    pl.col("WKT").map_elements(convert_to_shapely_area).alias("shapely_geometry")
)
print(poc.head())

enter image description here

Even though for the first shape the WKT correctly returns the area of the object, while for the second MultiPolygon the methods returns the following error:

IllegalArgumentException: Points of LinearRing do not form a closed linestring

What I have noticed between the two rows, is that the multipolygon of Newark Airport is a continues object of list[list[f64]]] coordinates. Whereas, the Jamaika Bay has multiple sublists [list[list[f64]]] elements (check column coordinates to verify this). Also the screenshot below verify this statement. enter image description here

Thus, is there any way to unify the multipolygons of Jamaica Bay into one single GEOmetric object before applying WKT?

P.S: Many solutions on GitHub use the shape file, but I would like to regularly re-download the NYC zones automatically with the SODA API.

To download the raw .json file from SODA API (omit logger_object lets pretend it's print())

import requests
params = {
        "$limit": geospatial_batch_size, #i.e. 50_000
        "$$app_token": config.get("api-settings", "app_token")
    }
    response = requests.get(api_url, params=params)
    if response.status_code == 200:
        data = response.json()
        if not data:
            logger_object.info("No data found, please check the API connection.")
            sys.exit()
        with open("{0}/nyc_zone_districts_data.json".format(geospatial_data_storage), "w") as f:
            json.dump(data, f, indent=4)
    else:
        logger_object.error("API request failed.")
        logger_object.error("Error: {0}".format(response.status_code))
        logger_object.error(response.text)

Solution

  • I have eventually found a solution for this problem. As already described I am flattening the list of coordinates (a.k.a Points). A point in geospatial data is a (x,y) coordinate. Thus, a MultiPolygon is a combination of multiple Points.

    def flatten_list_of_lists(data) -> list:
        return [subitem2 for subitem1 in data for subitem2 in subitem1]
    

    , this abovementioned function is important because the object data used as input argument is of type [list[list[list[f64]]]] and a MultiPolygon has a specific cardinality level per point.

    Then I transform its flattened list to a MultiPolygon object using shapely

    from shapely.geometry import Polygon, MultiPolygon, Point
    def transform_polygons_to_multipolygons(flat_list:list) -> list:
        return [ MultiPolygon( [Polygon(coord) for coord in polygon]).wkt for polygon in  flat_list]
    

    You will notice that after creating each MultiPolygon object I save it to WKT format (thus as a string).

    Finally, I compute the area of the multipolygon

    from shapely import wkt
    def compute_geo_areas(multipolygons:MultiPolygon) -> float:
        return wkt.loads(multipolygons).area
    

    The final code

    flattened_lists:list = [flatten_list_of_lists(row) for row in df_geo["coordinates"].to_list()]
    multipolygons:list = transform_polygons_to_multipolygons(flattened_lists)
    
    df_geo = df_geo.with_columns(pl.Series("multipolygons", multipolygons)) \
    .with_columns(polygon_area=pl.col("multipolygons").map_elements(compute_geo_areas)) \
    .drop("coordinates")
    

    , where df_geo is the Polars DataFrame loaded from the JSON file attached on the SO question.