As part of a project I'm exploring satellite data and the data is available in H5 format. I'm new to this format and I'm unable to process the data. I'm able to open the file in a software called Panoply and found that the DHI value is available in a format called Geo2D. Is there anyway to extract the data into a CSV format as shown below:
X | Y | GHI |
---|---|---|
X1 | Y1 | |
X2 | Y2 |
Attaching screenshots of the file opened in Panoply alongside.
Link to the file: https://drive.google.com/file/d/1xQHNgrlrbyNcb6UyV36xh-7zTfg3f8OQ/view
I tried the following code to read the data. I'm able to store it as a 2d numpy array, but unable to do it along with the location.
`
import h5py
import numpy as np
import pandas as pd
import geopandas as gpd
#%%
f = h5py.File('mer.h5', 'r')
for key in f.keys():
print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
print(type(f[key])) # get the object type: usually group or dataset
ls = list(f.keys())
key ='X'
masterdf=pd.DataFrame()
data = f.get(key)
dataset1 = np.array(data)
masterdf = dataset1
np.savetxt("FILENAME.csv",dataset1, delimiter=",")
#masterdf.to_csv('new.csv')
Found an effective way to read the data, convert it to a dataframe and convert the projection parameters.
Code is tracked here: https://github.com/rishikeshsreehari/boring-stuff-with-python/blob/main/data-from-hdf5-file/final_converter.py
Code is as follows:
import pandas as pd
import h5py
import time
from pyproj import Proj, transform
input_epsg=24378
output_epsg=4326
start_time = time.time()
with h5py.File("mer.h5", "r") as file:
df_X = pd.DataFrame(file.get("X")[:-2], columns=["X"])
df_Y = pd.DataFrame(file.get("Y"), columns=["Y"])
DHI = file.get("DHI")[0][:, :-2].reshape(-1)
final = df_Y.merge(df_X, how="cross").assign(DHI=DHI)[["X", "Y", "DHI"]]
final['X2'],final['Y2']=transform(input_epsg,output_epsg,final[["X"]].to_numpy(),final[["Y"]].to_numpy(),always_xy=True)
#final.to_csv("final_converted1.csv", index=False)
print("--- %s seconds ---" % (time.time() - start_time))