I have some large-ish scale data that I want to load to a pandas dataframe in blocks. Let's say that I have a collection of participants in a study each of which has conducted 1 or more repetitions of a given experiment and each experiment produces a vector of real valued measures. My data is initially grouped by participant with a block (2d array) of values 1 row per repetition 1 column per measure. I want to consolidate this into a data frame and to do so in a relatively efficient way.
It seems that there are some efficiencies to be had when loading these into my dataframe but I don't have enough experience with pandas to be sure that I am doing things the right way.
Here is a small example representing my current approach:
import pandas as pd
import numpy as np
# here I create some dummy data
Nmin = 3
Nmax = 7
K= 4
columns = ['name', 'repetition'] + [f'v{k}' for k in range(K)]
names = ['A', 'B', 'C']
Ns = {name: np.random.randint(Nmin, Nmax) for name in names}
repetitions = {name:np.arange(Ns[name]) for name in names}
values = {name:np.random.random((Ns[name],K)) for name in names}
# here I try to load this into a dataframe
df = pd.DataFrame(columns=columns)
for name in names:
these_reps = repetitions[name]
these_vals = values[name]
this_N = these_reps.size
these_names = [name]*this_N
this_df = pd.DataFrame(zip(these_names,these_reps,*these_vals.T.tolist()), columns=columns)
df = pd.concat([df,this_df])
Ignoring the block of code where I create the data and just focusing on the second block. I have a series of participants with names, and I can access the repetition numbers and the measures from somewhere.
My method above then loads one participants data at a time. Then turns everything into a generator with the zip command, trying as best as possible not to instantiate intermediate data-structures. Then I use this generator to create a dataframe for just that participants data. Finally, I concatenate this with the big consolidated dataframe.
Is this a reasonable approach? Are there alternative approaches or functionality that would solve my problem more efficiently/elegantly? I am particularly focused on the tolist call for the dataframe as this seems like it is inefficiently creating a large data-structure unnecessarily.
Thanks in advance.
"In fact, I currently have a dictionary with keys which are the participant ids and whose values are a pair of values (2d array NxK), and targets (1d array N). I also want to create the repetitions as I go."
Assuming this example:
names = {'A': 3, 'B': 3, 'C': 4}
K = 4
inpt = {name: (np.random.random((rep, K)), np.random.random(rep))
for name, rep in names.items()}
You could concatenate
the different arrays in numpy, then form Series/DataFrames for each type of columns directly and concat
arrays, vectors = map(np.concatenate, zip(*inpt.values()))
rep = {k: len(t[1]) for k,t in inpt.items()}
# {'A': 3, 'B': 3, 'C': 4}
s = pd.Series(np.repeat(list(rep), list(rep.values())), name='name')
# ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C']
out = pd.concat([s, s.groupby(s).cumcount().rename('repetition'),
pd.DataFrame(arrays).add_prefix('v'),
pd.Series(vectors, name='target')], axis=1)
Output:
name repetition v0 v1 v2 v3 target
0 A 0 0.995683 0.842921 0.783789 0.484767 0.722948
1 A 1 0.330375 0.390841 0.846387 0.310450 0.782210
2 A 2 0.647870 0.858490 0.820561 0.816269 0.842072
3 B 0 0.194760 0.637683 0.859817 0.056968 0.210325
4 B 1 0.317601 0.774134 0.769971 0.984046 0.016807
5 B 2 0.116071 0.566341 0.372419 0.349505 0.104499
6 C 0 0.425990 0.959911 0.926921 0.924259 0.616903
7 C 1 0.720187 0.858244 0.749304 0.118748 0.022709
8 C 2 0.895997 0.210606 0.873736 0.516069 0.507695
9 C 3 0.224749 0.943527 0.858289 0.694335 0.170652