I am currently running 100 simulations that computes 1M values per simulation (i.e. per episode/iteration there is one value).
My main
file looks like this:
# Defining the test simulation environment
def test_simulation:
environment = environment(
periods = 1000000
parameter_x = ...
parameter_y = ...
)
# Defining the simulation
environment.simulation()
# Run the simulation 100 times
for i in range(100):
print(f'--- Iteration {i} ---')
test_simulation()
The simulation procedure is as follows: Within game()
I generate a value_history
that is continuously appended:
def simulation:
for episode in range(periods):
value = doSomething()
self.value_history.append(value)
Hence, as a result, for each episode/iteration, I compute one value
that is an array, e.g. [1.4 1.9]
(player 1 having 1.4
and player 2 having 1.9
in the current episode/iteration).
To store the data, I use the approach proposed in Append simulation data using HDF5, which works perfectly fine.
After running the simulations, I receive the following Keys
structure:
Keys: <KeysViewHDF5 ['data_000', 'data_001', 'data_002', ..., 'data_100']>
Now, the goal is to compute averages and standard deviations for each value in the 100 data
files that I run, which means that, in the end, I would have a final_data
set consisting of 1M averages and 1M standard deviations (one average and one standard deviation for each row (for each player) across the 100 simulations).
The goal would thus be to get something like the the following structure [average_player1, average_player2], [std_player1, std_player2]
:
episode == 1: [1.5, 1.5], [0.1, 0.2]
episode == 2: [1.4, 1.6], [0.2, 0.3]
...
episode == 1000000: [1.7, 1.6], [0.1, 0.3]
I currently use the following code to extract the data storing it into an empty list:
def ExtractSimData(name, simulation_runs, length):
# Create empty list
result = []
# Call the simulation run file
filename = f"runs/{length}/{name}_simulation_runs2.h5"
with h5py.File(filename, "r") as hf:
# List all groups
print("Keys: %s" % hf.keys())
for i in range(simulation_runs):
a_group_key = list(hf.keys())[i]
data = list(hf[a_group_key])
for element in data:
result.append(element)
The data structure of result
looks something like this:
[array([1.9, 1.7]), array([1.4, 1.9]), array([1.6, 1.5]), ...]
I tried to use the following code to come up with a mean score for the first element (the array consists of two elements since there are two players in the simulation):
mean_result = [np.mean(k) for k in zip(*list(result))]
However, this computes the average of each element in the array across the whole list since I appended each data
set to the empty list. My goal, however, would be to compute an average/standard deviation across the 100 data
sets defined above (i.e. one value is the average/standard deviation across all 100 data sets).
Is there any way to efficiently accomplish this?
This calculates mean and standard deviation of episode/player values across multiple datasets in 1 file. I think it's what you want to do. If not, I can modify as needed. (Note: I created a small pseudo-data HDF5 file to replicate what you describe. For completeness, that code is at the end of this post.)
Outline of steps in the procedure summarized below (after opening the file):
p1_arr
and p2_arr
). shape[0]
is the episode (row) count, and shape[1]
is the simulation (dataset) count.hf.keys()
(which iterates over the dataset names). You could also iterate over names in list ds_names
created earlier. (I created it to simplify size calculations in step 2). The enumerate()
counter i
is used to index episode values for each simulation to the correct column in each player array.np.mean()
and np.std()
functions with the axis=1
parameter. That calculates the mean across each row of simulation results.'final_data'
dataset is a simple float array of shape=(# of episodes,4)
, where you need to know what value is in each column. (I suggest adding an attribute to document.)'final_data_named'
dataset uses a NumPy recarray so you can name the fields(columns). It has shape=(# of episodes,)
. You access each column by name.A note on statistics: calculations are sensitive to the sum()
operator's behavior over the range of values. If your data is well defined, the NumPy functions are appropriate. I investigated this a few years ago. See this discussion for all the details: when to use numpy vs statistics modules
Code to read and calculate statistics below.
import h5py
import numpy as np
def ExtractSimData(name, simulation_runs, length):
# Call the simulation run file
filename = f"runs/{length}/{name}simulation_runs2.h5"
with h5py.File(filename, "a") as hf:
# List all dataset names
ds_names = list(hf.keys())
print(f'Dataset names (keys): {ds_names}')
# Create empty arrays for player1 and player2 episode values
sim_cnt = len(ds_names)
print(f'# of simulation runs (dataset count) = {sim_cnt}')
ep_cnt = hf[ ds_names[0] ].shape[0]
print(f'# of episodes (rows) in each dataset = {ep_cnt}')
p1_arr = np.empty((ep_cnt,sim_cnt))
p2_arr = np.empty((ep_cnt,sim_cnt))
for i, ds in enumerate(hf.keys()): # each dataset is 1 simulation
p1_arr[:,i] = hf[ds][:,0]
p2_arr[:,i] = hf[ds][:,1]
ds1 = hf.create_dataset('final_data', shape=(ep_cnt,4),
compression='gzip', chunks=True)
ds1[:,0] = np.mean(p1_arr, axis=1)
ds1[:,1] = np.std(p1_arr, axis=1)
ds1[:,2] = np.mean(p2_arr, axis=1)
ds1[:,3] = np.std(p2_arr, axis=1)
dt = np.dtype([ ('average_player1',float), ('average_player2',float),
('std_player1',float), ('std_player2',float) ] )
ds2 = hf.create_dataset('final_data_named', shape=(ep_cnt,), dtype=dt,
compression='gzip', chunks=True)
ds2['average_player1'] = np.mean(p1_arr, axis=1)
ds2['std_player1'] = np.std(p1_arr, axis=1)
ds2['average_player2'] = np.mean(p2_arr, axis=1)
ds2['std_player2'] = np.std(p2_arr, axis=1)
### main ###
simulation_runs = 10
length='01'
name='test_'
ExtractSimData(name, simulation_runs, length)
Code to create pseudo-data HDF5 file below.
import h5py
import numpy as np
# Create some psuedo-test data
def test_simulation(i):
players = 2
periods = 1000
# Define the simulation with some random data
val_hist = np.random.random(periods*players).reshape(periods,players)
if i == 0:
mode='w'
else:
mode='a'
# Save simulation data (unique datasets)
with h5py.File('runs/01/test_simulation_runs2.h5', mode) as hf:
hf.create_dataset(f'data_{i:03}', data=val_hist,
compression='gzip', chunks=True)
# Run the simulation N times
simulations = 10
for i in range(simulations):
print(f'--- Iteration {i} ---')
test_simulation(i)