Compute mean and standard deviation for HDF5 data

I am currently running 100 simulations that computes 1M values per simulation (i.e. per episode/iteration there is one value).

Main Routine

My main file looks like this:

# Defining the test simulation environment
def test_simulation:
    environment = environment(
            periods = 1000000
            parameter_x = ...
            parameter_y = ...
      )

    # Defining the simulation
    environment.simulation()

# Run the simulation 100 times
for i in range(100):
    print(f'--- Iteration {i} ---')
    test_simulation()

The simulation procedure is as follows: Within game() I generate a value_history that is continuously appended:

def simulation:
    for episode in range(periods):
        value = doSomething()
        self.value_history.append(value)

Hence, as a result, for each episode/iteration, I compute one value that is an array, e.g. [1.4 1.9] (player 1 having 1.4 and player 2 having 1.9 in the current episode/iteration).

Storing of Simulation Data

To store the data, I use the approach proposed in Append simulation data using HDF5, which works perfectly fine.

After running the simulations, I receive the following Keys structure:

Keys: <KeysViewHDF5 ['data_000', 'data_001', 'data_002', ..., 'data_100']>

Computing Statistics for Files

Now, the goal is to compute averages and standard deviations for each value in the 100 data files that I run, which means that, in the end, I would have a final_data set consisting of 1M averages and 1M standard deviations (one average and one standard deviation for each row (for each player) across the 100 simulations).

The goal would thus be to get something like the the following structure [average_player1, average_player2], [std_player1, std_player2]:

episode == 1: [1.5, 1.5], [0.1, 0.2]
episode == 2: [1.4, 1.6], [0.2, 0.3]
...
episode == 1000000: [1.7, 1.6], [0.1, 0.3]

I currently use the following code to extract the data storing it into an empty list:

def ExtractSimData(name, simulation_runs, length):
        # Create empty list
        result = []

        # Call the simulation run file
        filename = f"runs/{length}/{name}_simulation_runs2.h5"

        with h5py.File(filename, "r") as hf:

            # List all groups
            print("Keys: %s" % hf.keys())

            for i in range(simulation_runs):
                a_group_key = list(hf.keys())[i]
                data = list(hf[a_group_key])

                for element in data:
                    result.append(element)

The data structure of result looks something like this:

[array([1.9, 1.7]), array([1.4, 1.9]), array([1.6, 1.5]), ...]

First Attempt to Compute Means

I tried to use the following code to come up with a mean score for the first element (the array consists of two elements since there are two players in the simulation):

mean_result = [np.mean(k) for k in zip(*list(result))]

However, this computes the average of each element in the array across the whole list since I appended each data set to the empty list. My goal, however, would be to compute an average/standard deviation across the 100 data sets defined above (i.e. one value is the average/standard deviation across all 100 data sets).

Is there any way to efficiently accomplish this?

Solution

This calculates mean and standard deviation of episode/player values across multiple datasets in 1 file. I think it's what you want to do. If not, I can modify as needed. (Note: I created a small pseudo-data HDF5 file to replicate what you describe. For completeness, that code is at the end of this post.)

Outline of steps in the procedure summarized below (after opening the file):

Get basic size info from file : dataset count and number of dataset rows
Use values above to size arrays for player 1 and 2 values (variables p1_arr and p2_arr). shape[0] is the episode (row) count, and shape[1] is the simulation (dataset) count.
Loop over all datasets. I used hf.keys() (which iterates over the dataset names). You could also iterate over names in list ds_names created earlier. (I created it to simplify size calculations in step 2). The enumerate() counter i is used to index episode values for each simulation to the correct column in each player array.
To get the mean and standard deviation for each row, use the np.mean() and np.std() functions with the axis=1 parameter. That calculates the mean across each row of simulation results.
Next, load the data into the result dataset. I created 2 datasets (same data, different dtypes) as described below:
a. The 'final_data' dataset is a simple float array of shape=(# of episodes,4), where you need to know what value is in each column. (I suggest adding an attribute to document.)
b. The 'final_data_named' dataset uses a NumPy recarray so you can name the fields(columns). It has shape=(# of episodes,). You access each column by name.

A note on statistics: calculations are sensitive to the sum() operator's behavior over the range of values. If your data is well defined, the NumPy functions are appropriate. I investigated this a few years ago. See this discussion for all the details: when to use numpy vs statistics modules

Code to read and calculate statistics below.

import h5py
import numpy as np

def ExtractSimData(name, simulation_runs, length):

    # Call the simulation run file
    filename = f"runs/{length}/{name}simulation_runs2.h5"
    with h5py.File(filename, "a") as hf:
        # List all dataset names
        ds_names = list(hf.keys())
        print(f'Dataset names (keys): {ds_names}')

        # Create empty arrays for player1 and player2 episode values
        sim_cnt = len(ds_names)
        print(f'# of simulation runs (dataset count) = {sim_cnt}')
        ep_cnt = hf[ ds_names[0] ].shape[0]
        print(f'# of episodes (rows) in each dataset = {ep_cnt}')
        p1_arr = np.empty((ep_cnt,sim_cnt))
        p2_arr = np.empty((ep_cnt,sim_cnt))
        
        for i, ds in enumerate(hf.keys()): # each dataset is 1 simulation               
            p1_arr[:,i] = hf[ds][:,0]
            p2_arr[:,i] = hf[ds][:,1]
                
        ds1 = hf.create_dataset('final_data', shape=(ep_cnt,4), 
                          compression='gzip', chunks=True)   
        ds1[:,0] = np.mean(p1_arr, axis=1)
        ds1[:,1] = np.std(p1_arr, axis=1)
        ds1[:,2] = np.mean(p2_arr, axis=1)
        ds1[:,3] = np.std(p2_arr, axis=1)        

        dt = np.dtype([ ('average_player1',float), ('average_player2',float), 
                        ('std_player1',float), ('std_player2',float) ] )
        ds2 = hf.create_dataset('final_data_named', shape=(ep_cnt,), dtype=dt, 
                          compression='gzip', chunks=True)   
        ds2['average_player1'] = np.mean(p1_arr, axis=1)
        ds2['std_player1'] = np.std(p1_arr, axis=1)
        ds2['average_player2'] = np.mean(p2_arr, axis=1)
        ds2['std_player2'] = np.std(p2_arr, axis=1)        

### main ###
simulation_runs = 10
length='01'
name='test_'
ExtractSimData(name, simulation_runs, length)

Code to create pseudo-data HDF5 file below.

import h5py
import numpy as np

# Create some psuedo-test data
def test_simulation(i):
    players = 2
    periods = 1000

    # Define the simulation with some random data
    val_hist = np.random.random(periods*players).reshape(periods,players)    
    
    if i == 0:
        mode='w'
    else:
        mode='a'
    # Save simulation data (unique datasets)
    with h5py.File('runs/01/test_simulation_runs2.h5', mode) as hf:
        hf.create_dataset(f'data_{i:03}', data=val_hist, 
                          compression='gzip', chunks=True)

# Run the simulation N times
simulations = 10
for i in range(simulations):
    print(f'--- Iteration {i} ---')
    test_simulation(i)