I have two rather simple dataframes/datasets covering the whole world in 0.5 degrees resolution.
Vegetation categories data
biom_df
ind lat lon biome
0 83.25 -43.25 0
1 83.25 -42.75 1
2 83.25 -42.25 1
3 83.25 -39.75 0
4 83.25 -38.75 2
and
Population data
pop_df
ind lat lon pop2020
0 83.6875 -33.5625 0
1 83.6875 -33.4375 0
2 83.6875 -33.3125 0
3 83.6875 -33.1875 0
4 83.6875 -33.0625 0
The goal is to obtain the total population sum in each biome category.
As you can see, the lat and lon coordinates of the two datasets are slightly different, and I haven't figured out how to match them by the "nearest" or "overlapping" grid cells.
I don't care if it is solved with numpy/pandas or xarray.
Thank you for your suggestions!
I tried:
biom_ds = biom_df.set_index(['lat','lon']).to_xarray()
pop_ds = pop_df.set_index(['lat','lon']).to_xarray()
pop_ds.sel(lat = biom_ds.loc[ind].lat,
lon = biom_ds.loc[ind].lon,
method='nearest')
but don't know how to insert that value now into the other dataset.
And I tried:
test = pop_df.merge(biom_df, how='outer', on=['lat','lon'])
test = test.sort_values(by=['lat','lon']).reset_index(drop=True)
test['biome'] = test.biome.ffill()
test = test[ (test.lat.isin(pop_df.lat.values)) & (test.lon.isin(pop_df.lat.values)) ]
But that seems to be a wrong way to do it, as the result plot shows (see picture)
If it interpreted your question right, you want to for each of the rows in pop_df
find the nearest biome number from biom_df
and get the sum of population grouped by biome number.
Here's one way to achieve that:
from scipy.spatial.distance import cdist
pop_df["nearest_biome_index"] = cdist(pop_df[["lat", "lon"]].to_numpy(), biom_df[["lat", "lon"]].to_numpy(), metric="euclidean").argmin(axis=1)
print(pop_df.join(biom_df[["biome"]], on="nearest_biome_index").groupby("biome")[["pop2020"]].sum())
This would with cdist
calculate the (euclidean) pairwise distances between each coordinate in pop_df
and biom_df
, then find the index of biom_df
where the distance is the smallest (this code assumes that indices start from 0 and increment by 1).
Then you can join the biome numbers from biome_df
to pop_df
based on the indices of the rows that you got with argmin
, and now you have the number of the nearest biome for each row.
After that it's just a matter of using group_by
and sum
to get the sum of population for each biome group.