TLDR: I'm trying to combine rows of a GeoPandas Dataframe into one row where their shapes are combined into one.
I'm currently working on a little project that requires me to create interactive choropleth plots of Canadian health regions using a few different metrics.
I had merged two Dataframes, one containing population estimates by year for each health region, and another GeoDataframe containing the geometry for the health regions, when I noticed that the number of rows wasn't the same.
Upon further inspection, I realized the two datasets I had been using didn't include the exact same health regions. The shape-files I got had a few more health regions than the population data, which had amalgamated a few of them for methodological reasons.
After noticing the difference, I redid the merge to show me the differences so I could figure out what I need to roll up.
merged_gdf = gdf.merge(df, on='HR_UID')
#HR_UID is just the name of the column with the codes for the health regions, since they
#have slightly different names in different datasets, it's easier to merge on code.
print(list(set(df['HEALTH_REGION'])-set(merged_gdf['HEALTH_REGION_y'])),list(set(gdf['HR_UID'])-set(df['HR_UID'].unique())))
Here I was shown the missing health region was ['Mamawetan/Keewatin/Athabasca, Saskatchewan']. The GeoDataframe has those three regions separate, with codes 4711, 4712, 4713, while the population data has them rolled up into one region with code 4714.
I intend on combining the rows of my GeoDataframe that correspond to the health regions combined in the population data, to combine their polygons. I went back to the GeoDataframe to try and combine the three rows corresponding to those regions:
old_hr=gdf[gdf['HR_UID'].isin({'4711','4712','4713'})]
HR_UID HEALTH_REGION SHAPE_AREA \
31 4711 Mamawetan Churchill River Regional Health Auth... 1.282120e+11
32 4712 Keewatin Yatthé Regional Health Authority 1.095536e+11
33 4713 Athabasca Health Authority 5.657720e+10
SHAPE_LEN geometry
31 1.707619e+06 POLYGON ((5602074.666 2364598.029, 5591985.366...
32 1.616297e+06 POLYGON ((5212469.723 2642030.691, 5273110.000...
33 1.142962e+06 POLYGON ((5248633.914 2767057.263, 5249285.640...
Now I've come to the realization that I'm not exactly sure how to combine polygons in a GeoDataframe. I have tried using dissolve(on='HEALTH_REGION')
, although that didn't work. I've spent a while looking around online, but thus far it seems I can't find anyone asking this particular question - perhaps I'm missing something..
Turns out it was actually simpler than I had imagined, and I was just confused about some additional columns in the dataframe that weren't actually necessary for the mapping. I'm new to Geopandas and mapping in general, so I hadn't realized the SHAPE_AREA
and SHAPE_LEN
weren't actually needed.
Here was the code I used to import the dataframe without the extra columns and then combine the 3 polygons:
# if this is not "pythonic" let me know, I'm still a python rookie, but this
# worked for me.
gdf = gpd.read_file('data/HR_Boundary_Files/HR_000b18a_e.shp', encoding='utf-8').drop(columns={'FRENAME', 'SHAPE_AREA','SHAPE_LEN'})
gdf.rename(columns={'ENGNAME':'HEALTH_REGION'}, inplace=True)
old_hr=gdf[gdf['HR_UID'].isin({'4711','4712','4713'})]
gdf=gdf[~gdf['HR_UID'].isin({'4711','4712','4713'})]
new_region_geometry = old_hr['geometry'].unary_union
gdf=gdf.append(pd.Series(['4714', 'Mamawetan/Keewatin/Athabasca Health Region', new_region_geometry],
index=gdf.columns), ignore_index=True)
The unary_union
property of GeoSeries returns the union of all the geometries, which gave me the new shape I needed. I just added that into the dataframe with the correct region name and code, and dropped the old regions that made up the new one.