Search code examples
pythonoptimizationnumpypandasscientific-computing

Efficient algorithm for expanding grouped tabular data


I am looking for an optimized tool in python to perform an array manipulation task that I find myself doing over and over. If the tool already exists, for example in numpy or pandas, I would rather implement that rather continue using my own cythonized for loop.

I have two arrays of the same length, A and B, storing some data about grouped data. The ith entry of array A tells me some property of group i; the jth entry of array B tells me how many members there are in group j; A stores floats, B stores ints. So, for definiteness, if A[5]=100.4 & B[5]=7, then group 5 has a mass equal to 100.4, and there are 7 members of that group.

My goal is to create a new array of floats, C, of length B.sum(), which is the expansion of the above dataset. So C[0:B[0]] = A[0], C[B[0]:B[1]] = A[1], and so forth. Is there an optimized solution to do this in an existing library such as pandas?

My existing solution is to initialize an empty array C, and then run a for loop over the elements of A, indexing the common elements of C as above. I have been writing and compiling the for loop in cython, for speed. But this particular operation is the single biggest bottleneck in my code, and it seems like an awfully common array manipulation when working with tabular data, so I'm wondering if there is a heavily optimized algorithm out there to do it already.


Solution

  • Numpy has repeat() for that type of thing.

    Given two arrays

    A = np.array([100.4,98.3,88.5])
    B = np.array([7,3,10])
    np.repeat(A,B)
    

    will give you

    array([ 100.4,  100.4,  100.4,  100.4,  100.4,  100.4,  100.4,   98.3,
             98.3,   98.3,   88.5,   88.5,   88.5,   88.5,   88.5,   88.5,
             88.5,   88.5,   88.5,   88.5])