Search code examples
amazon-web-servicesdatetimeamazon-s3julianetcdf

Time conversion from AWS NetCDF file (noaa-16 bucket)


I've downloaded a dataset ("OR_GLM-L2-LCFA_G16_s20202530000000_e20202530000200_c20202530000226.nc") from the noaa-16 bucket hosted on AWS (Amazon Web Services). The dataset contains a variable titled "flash_time_offset_of_first_event", detailing the number of seconds since 9/9/2020, between 00:00:00 and 00:00:20...

Screenshot of AWS NetCDF file as seen in Panoly

I am reading the netCDF file using:

using NCDatasets
flash_times = []
fname1 = "OR_GLM-L2-LCFA_G16_s20202530000000_e20202530000200_c20202530000226.nc"
fpath1 = string("C:\\mydir\\",fname1)
NCDataset(fpath1) do ds
    [push!(flash_times,string(x)) for x in ds["flash_time_offset_of_first_event"][:,:]]
end

which produces:

sort(flash_times)
481-element Array{Any,1}:
 "2020-09-08T23:59:42.594"
 "2020-09-08T23:59:42.672"
 "2020-09-08T23:59:42.688"
 ⋮
 "2020-09-09T00:00:07.324"
 "2020-09-09T00:00:07.366"
 "2020-09-09T00:00:07.42"

The problem is that the times do not match the times shown in the plot of the values, as plotted in Panoply (shown below). In Panoply, the plotted values range from ~-0.7s (2020-09-08T23:59:59.3) to ~19.4s (2020-09-09T00:00:19.4):

A plot of the flash times, as plotted by Panoply

I'm extracting the earliest and latest times in my array of DateTimes using:

@info("",Earliest_time=sort(flash_times)[1],Latest_time=sort(flash_times)[end])

which produces:

┌ Info: 
│   Earliest_time = "2020-09-08T23:59:42.594"
└   Latest_time = "2020-09-09T00:00:07.42"   

My question: How can I correctly extract these times, or correct the times I have extracted? The NetCDF also contains information for scale_factor and add_offset variables, but I have not been able to implement these so far. I've also tried extracting the netCDF data using the NetCDF package, but this returns an array of integers, which I've tried (unsuccessfully) to convert using the scale_factor and add_offset variables.


Solution

  • NCDatasets automatically applies the scale_factor and add_offset attributes already, but there is another attribute here, _Unsigned, which NCDatasets doesn't know about yet, while other libraries like used in Panoply and xarray do. I created this issue for it: https://github.com/Alexander-Barth/NCDatasets.jl/issues/133.

    So in short the data is stored as an Int16, and has negative values in the second half. Since these negative values are supposed to be interpreted as unsigned (positive) values, this affects the dates that come out in the end. We can dig down and apply the steps ourselves to get the correct value from the raw data for now:

    using NCDatasets
    using Downloads
    using Dates
    url = "http://ftp.cptec.inpe.br/goes/goes16/glm/2020/09/09/OR_GLM-L2-LCFA_G16_s20202530000000_e20202530000200_c20202530000226.nc"
    
    path = Downloads.download(url)
    ds = NCDataset(path)
    var = ds["flash_time_offset_of_first_event"]
    # flash_time_offset_of_first_event (481)
    #   Datatype:    Int16
    #   Dimensions:  number_of_flashes
    #   Attributes:
    #    long_name            = GLM L2+ Lightning Detection: time of occurrence of first constituent event in flash
    #    standard_name        = time
    #    _Unsigned            = true
    #    scale_factor         = 0.0003814756
    #    add_offset           = -5.0
    #    units                = seconds since 2020-09-09 00:00:00.000
    #    axis                 = T
    
    # see the issue, these dates are incorrect
    extrema(var)
    # DateTime("2020-09-08T23:59:42.594")
    # DateTime("2020-09-09T00:00:07.420")
    
    raw = var.var[:]  # Vector{Int16}
    # interpret as unsigned, and apply scale and offset
    val = reinterpret(unsigned(eltype(raw)), raw) .* NCDatasets.scale_factor(var) .+ NCDatasets.add_offset(var)
    # get dates by adding these as seconds to the epoch
    epoch = DateTime("2020-09-09T00:00:00")
    times = epoch .+ Millisecond.(round.(Int, val .* 1000))
    extrema(times)  # these are as expected
    # DateTime("2020-09-08T23:59:59.251")
    # DateTime("2020-09-09T00:00:19.408")
    

    Of course this is not ideal, and it would be easier for users to address this in NCDatasets itself.