I'm trying to write a function that takes a netCDF file (which has NaNs or a specific value in it) and masks another netCDF file (that is mostly full, but has some NaNs where there is no data). It's geographic data, so I'm trying to mask a region.
So the result I expect is a map with only the region of interest shown, but what I get is a the full data of the file I tried to mask, as if the masking did nothing. I don't get any errors when executing the code. Here is what I have tried:
def mask_data(ifile, mask_folder,data_masked):
# data_masked = '/file/path/to/save/outout/'
# ifile = netcdf file
# mask_folder = folder with various netcdf files, all of which are independent masks and loop over the ifile, to generate different maps with different selected regions
ds = xr.open_dataset(ifile, engine='netcdf4')
for mask_file in os.listdir(mask_folder):
mask_path = os.path.join(mask_folder, mask_file)
output_file = masked+'masked_data'+mask_file+'.nc'
ds_mask = xr.open_dataset(mask_path)
# apply the mask to the ifile data
ds_masked = ds['var'].where(ds_mask)
ds_ghf_masked.to_netcdf(output_file)
ds_mask.close() # remember to close the opened datasets!!
ds_ghf.close()
return
I am at a complete loss as to why, when I call mask_data(), I simply get the ifile saved in the directory masked_data 20 times (the amount of separate mask files I have). I thought maybe they are layering on top of one another, but that would imply only the last file would be the full ifile, but all of them are.
What can I try next?
I see a few issues in your code that are causing problems.
First I'm not sure based on your post what values in ds_mask
represent the values you want to keep. Is it where the mask equals 1? or where the mask is not NaN?
Second, the variable you are saving out with .to_netcdf()
is not the one you just created with the masking. The variable you created is ds_masked
and the one you are saving out is ds_ghf_masked
so that will be why the saved netcdf files are all the same.
Assuming you want to keep data where the mask equals 1 and drop the rest, and assuming that the name of the variable in ds_mask
is mask
, you want to do:
def mask_data(ifile, mask_folder,data_masked):
# data_masked = '/file/path/to/save/outout/'
# ifile = netcdf file
# mask_folder = folder with various netcdf files, all of which are independent masks and loop over the ifile, to generate different maps with different selected regions
ds = xr.open_dataset(ifile, engine='netcdf4')
for mask_file in os.listdir(mask_folder):
mask_path = os.path.join(mask_folder, mask_file)
output_file = masked+'masked_data'+mask_file+'.nc'
ds_mask = xr.open_dataset(mask_path)
# apply the mask to the ifile data
# this will give a DataArray as output
da_masked = ds['var'].where(ds_mask['mask']==1, drop=True)
# Convert the DataArray to a Dataset for saving
ds_masked = da_masked.to_dataset()
# Save Dataset out to netcdf
ds_masked.to_netcdf(output_file)
return
To explain, xr.Dataset.where(cond)
keeps the data where cond
is True. Then if drop=False
(the default), it sets the values where cond
is False to NaN, or if drop=True
, it gets rid of the values where cond
is False. Hope that helps!