Search code examples
pythonpandasnumpyvectorencapsulation

Encapsulating Vectorised Functions - For Use With Panda DataFrames


I've been re-factoring some code and using it to explore how to structure maintainable, flexible, concise code when using Pandas and Numpy. (Usually I only use them briefly, I'm now in a role where I should be aiming to become an ex-spurt.)

One example I came across is a function that can sometimes be called on one column of values, and sometimes called on three columns of values. Vectorised code using Numpy encapsulated it wonderfully. But using it becomes a bit clunky.

How should I "better" write the following function?

def project_unit_space_to_index_space(v, vertices_per_edge):
    return np.rint((v + 1) / 2 * (vertices_per_edge - 1)).astype(int)


input = np.concatenate(([df['x']], [df['y']], [df['z']]), axis=0)

index_space = project_unit_space_to_index_space(input, 42)

magic_space = some_other_transformation_code(index_space, foo, bar)

df['x_'], df['y_'], df['z_'] = magic_space

As written the function can accept one column of data, or many columns of data, and it still works correctly, and speedily.

The return type is the right shape to be passed directly in to another similarly structured function, allowing me to chain functions neatly.

Even assigning the results back to new columns in a dataframe isn't "awful", though it is a little clunky.

But packaging up the inputs as a single np.ndarray is very very clunky indeed.


I haven't found any style guides that cover this. They're all over itterrows and lambda expressions, etc. But I found nothing on best practices for encapsulating such logic.


So, how you you structure the above code?


EDIT: Timings of various options for collating the inputs

%timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].unstack().to_numpy())                      
# 1.44 ms ± 57.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].to_numpy().T)                              
# 558 µs ± 6.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].transpose().to_numpy())                    
# 817 µs ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit test = project_unit_sphere_to_unit_cube(np.concatenate(([df['x']], [df['y']], [df['z']]), axis=0))   
# 3.46 ms ± 42.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Solution

  • In [101]: df = pd.DataFrame(np.arange(12).reshape(4,3))                         
    In [102]: df                                                                    
    Out[102]: 
       0   1   2
    0  0   1   2
    1  3   4   5
    2  6   7   8
    3  9  10  11
    

    You are making a (n,m) array from n columns of the dataframe:

    In [103]: np.concatenate([[df[0]],[df[1]],[df[2]]],0)                           
    Out[103]: 
    array([[ 0,  3,  6,  9],
           [ 1,  4,  7, 10],
           [ 2,  5,  8, 11]])
    

    a more compact way to do this is transpose the array of those columns:

    In [104]: df.to_numpy().T                                                       
    Out[104]: 
    array([[ 0,  3,  6,  9],
           [ 1,  4,  7, 10],
           [ 2,  5,  8, 11]])
    

    The dataframe has its own transpose:

    In [109]: df.transpose().to_numpy()                                             
    Out[109]: 
    array([[ 0,  3,  6,  9],
           [ 1,  4,  7, 10],
           [ 2,  5,  8, 11]])
    

    Your calculation works with a dataframe, returning a dataframe with same shape and indices:

    In [113]: np.rint((df+1)/2 *(42-1)).astype(int)                                 
    Out[113]: 
         0    1    2
    0   20   41   62
    1   82  102  123
    2  144  164  184
    3  205  226  246
    

    Some numpy functions convert the inputs to numpy arrays and return an array. Others, by delegating details to pandas methods, can work directly on the dataframe, and return a dataframe.