Search code examples
pythonpandaspython-xarraynetcdfera5

How to Extract timeseries from gridded lat lon data (ERA5) in Python?


I have downloaded ERA5 global data which contains the following params:

  1. total precipitation
  2. snow fall
  3. cloud cover
  4. temperature at 2m

The data range is from Jan 2008 to Dec 2024.I am using xarray package in python to deal with netcdf files but I've been scratching my head this time looking at the data.

The issue is, the data is gridded by lat which means for every lat value there are multiple longitude values (-180 to +180) which means there are multiple dates for each lat, long (repeating dates)

is there any way to extract timeseries for Pakistan region along with latitude and longitude?

enter image description here

This is what the data look like

enter image description here

I am using the .sel() method from xarray but am feeling completely blank and have no clue what to do in this type of scenarios. I don't want to lose data points as I am trying to build a forecasting model.


Solution

  • First of all, you need to preprocess a little you dataset. Look at the coordinates. The latitude starts at 90º and ends at -90. Imagine you want to select latitude from 23.5° to 37° (which I think corresponds to Pakistan). If you try to select .sel(latitude=slice(23.5, 37)) you will obtain 0 dimensions because latitude goes from 90º to -90º.

    You need first to sort the latitude dimension:

    data = data.sortby(data.latitude)
    

    Also, your longitude is between 0º and 360º, so I recommend you to change it to -180º to 180º using:

    data = data.assign_coords(longitude=(((data.longitude + 180) % 360) - 180))
    data = data.sortby(data.longitude)
    

    I think Pakistan is between 23.5°N to 37.0°N and 60.5°E to 77.5°E, but change the following code to the region you are interested in. To obtain the timeseries that you want you will need to:

    # Select region
    data_pak = data.sel(latitude=slice(23.5, 37.0), longitude=slice(60.5, 77.5))
    
    # Transform to dataframe and change dimensions
    stacked = data_pak.stack(latlon=('latitude', 'longitude'))
    df = stacked.to_dataframe()
    df_unstacked = df.unstack('latlon')
    df_unstacked.columns = [f'{var}_{lat}_{lon}' for var, lat, lon in df_unstacked.columns]
    df_result = df_unstacked.reset_index()
    

    And then, df_result is the result. If this doesn't work, you can try a more "manual" way to do it:

    # Select region
    data_pak = data.sel(latitude=slice(23.5, 37.0), longitude=slice(60.5, 77.5))
    
    # Transform to dataframe and change dimensions
    df = data_pak.to_dataframe()
    df['lat_lon'] = df.apply(lambda row: f"lat={row['lat']}lon={row['lon']}", axis=1)
    df_melted = df.melt(id_vars=['time', 'lat_lon'], value_vars=['expver', 'tp', 'e', 'sf'], var_name='variable', value_name='value')
    df_melted['new_col'] = df_melted['variable'] + "-" + df_melted['lat_lon']
    df_result = df_melted.pivot(index='time', columns='new_col', values='value')
    df_result.reset_index(inplace=True)