Search code examples
pythongeopandasproj

How to remove shapes that cause a problem upon reprojection from a geopandas dataframe?


In this geopandas example, the Antarctica land mass is dropped from the GeoDataFrame before reprojecting to the Mercator projection, in order to prevent problem with shapes containing a pole (which would become infinitely large).

I was wondering, if it is possible to find a more robust reprojection method, so that the dataframe does not need to be manually adjusted. Especially since I'm working with a dataset that does not have a separate row for Antarctica:

enter image description here

I have two ideas:

1. Use information about the destination crs

On the epsg website for example, the 'Area of use' is shown:

enter image description here

We could use it to prepare the data before the reprojection: drop any shapes that extend further South than -80 deg, or alternatively intersect it with a shapely Polygon that describes the area of use of the destination crs, in terms of the source crs - in this case the standard epsg:4326 so Polygon([(-180,-80), (-180,84), ...]).

Issue with this approach: I'm not sure, if this area of use information is programmatically accessible from somewhere for any crs, e.g. from the GeoDataFrame object.

2. Fix in post

Just do it, and pick out the erroneously reprojcted parts later. In my current case, for example, the reprojected geodataframe gdf_merc = gdf.to_crs(epsg=3395) does have errors...

enter image description here

...but by searching for the inf word in the string representation of the geometry, I can find the offending Polygon within the MultiPolygon...

In [360]: for i, polygon in enumerate(gdf_merc.geometry[0]):
   ...:     if 'inf' in str(polygon): 
   ...:         print(i)
  
0

...and just remove it:

enter image description here

Issue with this approach: seems elaborate and I'd prefer to prevent any problems from appearing in the first place.

Any thoughts on how to fix either of these methods, or is there a third way?

One remark: I'm interested in the general case, where any crs could be reprojected to, so I don't want to preemptively remove Antarctica ("just in case"), as other projections might be perfectly fine with it, and, more importantly, they might have other problem areas.

Many thanks!


Solution

  • The option 1 is likely the best shot in here. Latest GeoPandas uses pyproj.CRS to store CRS data, from which you can easily extract bounds of the projection.

    To extract it from df:

    import geopandas as gpd
    df = gpd.read_file(gpd.datasets.get_path('nybb'))
    df.crs.area_of_use.bounds
    

    To get it from target CRS using pyproj directly:

    import pyproj
    crs = pyproj.CRS.from_epsg(3395)
    crs.area_of_use.bounds
    

    Then you can use built-in geopandas.clip to clip your data.

    from shapely.geometry import box
    
    df = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
    crs = pyproj.CRS.from_epsg(3395)
    bounds = crs.area_of_use.bounds
    clipped = gpd.clip(df, box(*bounds))
    clipped.to_crs(crs).plot()
    

    result