Search code examples
pythonnetcdfpython-xarray

Using xarray where() to create a mask from a netCDF file


I'm trying to write a function that takes a netCDF file (which has NaNs or a specific value in it) and masks another netCDF file (that is mostly full, but has some NaNs where there is no data). It's geographic data, so I'm trying to mask a region.

So the result I expect is a map with only the region of interest shown, but what I get is a the full data of the file I tried to mask, as if the masking did nothing. I don't get any errors when executing the code. Here is what I have tried:

  • I've checked the directories of all the files in the function; they exist, and I have read and write permission (I am using a local network and my machine)
  • generally toyed around with what variable I am selecting etc
def mask_data(ifile, mask_folder,data_masked):
# data_masked = '/file/path/to/save/outout/'
# ifile = netcdf file
# mask_folder = folder with various netcdf files, all of which are independent masks and loop over the ifile, to generate different maps with different selected regions
    ds = xr.open_dataset(ifile, engine='netcdf4')
    
    for mask_file in os.listdir(mask_folder):
        mask_path = os.path.join(mask_folder, mask_file)
        output_file = masked+'masked_data'+mask_file+'.nc'
        ds_mask = xr.open_dataset(mask_path)

        # apply the mask to the ifile data
        ds_masked = ds['var'].where(ds_mask)

        ds_ghf_masked.to_netcdf(output_file)

        ds_mask.close() # remember to close the opened datasets!!
        
    ds_ghf.close()
    return

I am at a complete loss as to why, when I call mask_data(), I simply get the ifile saved in the directory masked_data 20 times (the amount of separate mask files I have). I thought maybe they are layering on top of one another, but that would imply only the last file would be the full ifile, but all of them are.

What can I try next?


Solution

  • I see a few issues in your code that are causing problems.

    • First I'm not sure based on your post what values in ds_mask represent the values you want to keep. Is it where the mask equals 1? or where the mask is not NaN?

    • Second, the variable you are saving out with .to_netcdf() is not the one you just created with the masking. The variable you created is ds_masked and the one you are saving out is ds_ghf_masked so that will be why the saved netcdf files are all the same.

    • Assuming you want to keep data where the mask equals 1 and drop the rest, and assuming that the name of the variable in ds_mask is mask, you want to do:

    def mask_data(ifile, mask_folder,data_masked):
    # data_masked = '/file/path/to/save/outout/'
    # ifile = netcdf file
    # mask_folder = folder with various netcdf files, all of which are independent masks and loop over the ifile, to generate different maps with different selected regions
        ds = xr.open_dataset(ifile, engine='netcdf4')
        
        for mask_file in os.listdir(mask_folder):
            mask_path = os.path.join(mask_folder, mask_file)
            output_file = masked+'masked_data'+mask_file+'.nc'
            ds_mask = xr.open_dataset(mask_path)
    
            # apply the mask to the ifile data
            # this will give a DataArray as output
            da_masked = ds['var'].where(ds_mask['mask']==1, drop=True)
    
            # Convert the DataArray to a Dataset for saving
            ds_masked = da_masked.to_dataset()
    
            # Save Dataset out to netcdf
            ds_masked.to_netcdf(output_file)
    
        return
    

    To explain, xr.Dataset.where(cond) keeps the data where cond is True. Then if drop=False (the default), it sets the values where cond is False to NaN, or if drop=True, it gets rid of the values where cond is False. Hope that helps!