Search code examples
pandasdataframeperformanceconcatenation

An appropriate strategy to load blocks to pandas dataframe with repeated elements


I have some large-ish scale data that I want to load to a pandas dataframe in blocks. Let's say that I have a collection of participants in a study each of which has conducted 1 or more repetitions of a given experiment and each experiment produces a vector of real valued measures. My data is initially grouped by participant with a block (2d array) of values 1 row per repetition 1 column per measure. I want to consolidate this into a data frame and to do so in a relatively efficient way.

It seems that there are some efficiencies to be had when loading these into my dataframe but I don't have enough experience with pandas to be sure that I am doing things the right way.

Here is a small example representing my current approach:

import pandas as pd
import numpy as np

# here I create some dummy data
Nmin = 3
Nmax = 7
K= 4
columns = ['name', 'repetition'] + [f'v{k}' for k in range(K)]
names = ['A', 'B', 'C']
Ns = {name: np.random.randint(Nmin, Nmax) for name in names}
repetitions = {name:np.arange(Ns[name]) for name in names}
values = {name:np.random.random((Ns[name],K)) for name in names}


# here I try to load this into a dataframe
df = pd.DataFrame(columns=columns)
for name in names:
    these_reps = repetitions[name]
    these_vals = values[name]
    this_N = these_reps.size
    these_names = [name]*this_N
    this_df = pd.DataFrame(zip(these_names,these_reps,*these_vals.T.tolist()), columns=columns)
    df = pd.concat([df,this_df])

Ignoring the block of code where I create the data and just focusing on the second block. I have a series of participants with names, and I can access the repetition numbers and the measures from somewhere.

My method above then loads one participants data at a time. Then turns everything into a generator with the zip command, trying as best as possible not to instantiate intermediate data-structures. Then I use this generator to create a dataframe for just that participants data. Finally, I concatenate this with the big consolidated dataframe.

Is this a reasonable approach? Are there alternative approaches or functionality that would solve my problem more efficiently/elegantly? I am particularly focused on the tolist call for the dataframe as this seems like it is inefficiently creating a large data-structure unnecessarily.

Thanks in advance.


Solution

  • "In fact, I currently have a dictionary with keys which are the participant ids and whose values are a pair of values (2d array NxK), and targets (1d array N). I also want to create the repetitions as I go."

    Assuming this example:

    names = {'A': 3, 'B': 3, 'C': 4}
    K = 4
    
    inpt = {name: (np.random.random((rep, K)), np.random.random(rep))
            for name, rep in names.items()}
    

    You could concatenate the different arrays in , then form Series/DataFrames for each type of columns directly and concat

    arrays, vectors = map(np.concatenate, zip(*inpt.values()))
    
    rep = {k: len(t[1]) for k,t in inpt.items()}
    # {'A': 3, 'B': 3, 'C': 4}
    
    s = pd.Series(np.repeat(list(rep), list(rep.values())), name='name')
    # ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C']
    
    out = pd.concat([s, s.groupby(s).cumcount().rename('repetition'),
                     pd.DataFrame(arrays).add_prefix('v'),
                     pd.Series(vectors, name='target')], axis=1)
    

    Output:

      name  repetition        v0        v1        v2        v3    target
    0    A           0  0.995683  0.842921  0.783789  0.484767  0.722948
    1    A           1  0.330375  0.390841  0.846387  0.310450  0.782210
    2    A           2  0.647870  0.858490  0.820561  0.816269  0.842072
    3    B           0  0.194760  0.637683  0.859817  0.056968  0.210325
    4    B           1  0.317601  0.774134  0.769971  0.984046  0.016807
    5    B           2  0.116071  0.566341  0.372419  0.349505  0.104499
    6    C           0  0.425990  0.959911  0.926921  0.924259  0.616903
    7    C           1  0.720187  0.858244  0.749304  0.118748  0.022709
    8    C           2  0.895997  0.210606  0.873736  0.516069  0.507695
    9    C           3  0.224749  0.943527  0.858289  0.694335  0.170652