Search code examples
pythonspatial

Matching two pandas dataframes / xarray datasets with slightly different coordinates?


I have two rather simple dataframes/datasets covering the whole world in 0.5 degrees resolution.

Vegetation categories data

biom_df
ind      lat     lon  biome
0      83.25  -43.25      0
1      83.25  -42.75      1
2      83.25  -42.25      1
3      83.25  -39.75      0
4      83.25  -38.75      2

and

Population data

pop_df
ind         lat      lon  pop2020
0       83.6875 -33.5625        0
1       83.6875 -33.4375        0
2       83.6875 -33.3125        0
3       83.6875 -33.1875        0 
4       83.6875 -33.0625        0   

The goal is to obtain the total population sum in each biome category.

As you can see, the lat and lon coordinates of the two datasets are slightly different, and I haven't figured out how to match them by the "nearest" or "overlapping" grid cells.

I don't care if it is solved with numpy/pandas or xarray.

Thank you for your suggestions!

I tried:

biom_ds = biom_df.set_index(['lat','lon']).to_xarray()
pop_ds  = pop_df.set_index(['lat','lon']).to_xarray()
pop_ds.sel(lat = biom_ds.loc[ind].lat, 
           lon = biom_ds.loc[ind].lon, 
           method='nearest')

but don't know how to insert that value now into the other dataset.

And I tried:

test = pop_df.merge(biom_df, how='outer', on=['lat','lon'])
test = test.sort_values(by=['lat','lon']).reset_index(drop=True)
test['biome'] = test.biome.ffill()
test = test[ (test.lat.isin(pop_df.lat.values)) & (test.lon.isin(pop_df.lat.values)) ]

But that seems to be a wrong way to do it, as the result plot shows (see picture)

dataframe_merge_test_result


Solution

  • If it interpreted your question right, you want to for each of the rows in pop_df find the nearest biome number from biom_df and get the sum of population grouped by biome number.

    Here's one way to achieve that:

    from scipy.spatial.distance import cdist
    
    pop_df["nearest_biome_index"] = cdist(pop_df[["lat", "lon"]].to_numpy(), biom_df[["lat", "lon"]].to_numpy(), metric="euclidean").argmin(axis=1)
    
    print(pop_df.join(biom_df[["biome"]], on="nearest_biome_index").groupby("biome")[["pop2020"]].sum())
    

    This would with cdist calculate the (euclidean) pairwise distances between each coordinate in pop_df and biom_df, then find the index of biom_df where the distance is the smallest (this code assumes that indices start from 0 and increment by 1).

    Then you can join the biome numbers from biome_df to pop_df based on the indices of the rows that you got with argmin, and now you have the number of the nearest biome for each row.

    After that it's just a matter of using group_by and sum to get the sum of population for each biome group.