Search code examples
pythongisgdalh5py

How to read a H5 file containing satellite data in Python?


As part of a project I'm exploring satellite data and the data is available in H5 format. I'm new to this format and I'm unable to process the data. I'm able to open the file in a software called Panoply and found that the DHI value is available in a format called Geo2D. Is there anyway to extract the data into a CSV format as shown below:

X Y GHI
X1 Y1
X2 Y2

Attaching screenshots of the file opened in Panoply alongside.

Link to the file: https://drive.google.com/file/d/1xQHNgrlrbyNcb6UyV36xh-7zTfg3f8OQ/view

I tried the following code to read the data. I'm able to store it as a 2d numpy array, but unable to do it along with the location.

`

import h5py
import numpy as np
import pandas as pd
import geopandas as gpd


#%%
f = h5py.File('mer.h5', 'r')

for key in f.keys():
    print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
    print(type(f[key])) # get the object type: usually group or dataset
    ls = list(f.keys())
   


key ='X'


masterdf=pd.DataFrame()


data = f.get(key)   
dataset1 = np.array(data)
masterdf = dataset1


np.savetxt("FILENAME.csv",dataset1, delimiter=",")


#masterdf.to_csv('new.csv')

enter image description here

enter image description here `


Solution

  • Found an effective way to read the data, convert it to a dataframe and convert the projection parameters.

    Code is tracked here: https://github.com/rishikeshsreehari/boring-stuff-with-python/blob/main/data-from-hdf5-file/final_converter.py

    Code is as follows:

    import pandas as pd
    import h5py
    import time
    from pyproj import Proj, transform
    
    
    input_epsg=24378
    output_epsg=4326
    
    start_time = time.time()
    
    
    with h5py.File("mer.h5", "r") as file:
        df_X = pd.DataFrame(file.get("X")[:-2], columns=["X"])
        df_Y = pd.DataFrame(file.get("Y"), columns=["Y"])
        DHI = file.get("DHI")[0][:, :-2].reshape(-1)
    
    final = df_Y.merge(df_X, how="cross").assign(DHI=DHI)[["X", "Y", "DHI"]]
    
    
    
    final['X2'],final['Y2']=transform(input_epsg,output_epsg,final[["X"]].to_numpy(),final[["Y"]].to_numpy(),always_xy=True)
    
    
    #final.to_csv("final_converted1.csv", index=False)
    
    print("--- %s seconds ---" % (time.time() - start_time))