Search code examples
pythonpandasdataframenumpyvectorization

Manipulation of a Pandas dataframe most time- and memory-efficiently


Please imagine I have a dataframe like this:

df = pd.DataFrame(index=pd.Index(['1', '1', '2', '2'], name='from'), columns=['to'], data= ['2', '2', '4', '5'])

df:

enter image description here

Now, I would like to calculate a matrix comprising of the percentage of times each value in the index "from" transitions to each value in column 'to', which is known as a transition matrix. I can achieve this by creating an empty transition matrix first and then populating it with the percentages using a for loop:

#Create an empty matrix to populate later (using sparse dtype to save memory):
matrix = pd.DataFrame(index=df.index.unique(), columns=df.to.unique(), data=0, dtype=pd.SparseDtype(dtype=np.float16, fill_value=0)) 

matrix:

enter image description here

for i in range(len(df)):
    from_, to = df.index[i], df.to.iloc[i]     
    matrix[to] = matrix[to].sparse.to_dense() # Convert to dense format because sparse dtype does not allow value assignment with .loc in the next line:  
    matrix.loc[from_, to] += 1     # Do a normal insertion with .loc[]
    matrix[to] = matrix[to].astype(pd.SparseDtype(dtype=np.float16, fill_value=0)) # Back to the original sparse format

matrix = (matrix.div(matrix.sum(axis=1), axis=0)*100) # converting counts to percentages

matrix:

enter image description here

This works. For example, index "1" only transitioned to "2" (100% of the time) and index "2" transitioned to "4" 50% of the time and to "5" the other 50% of the time, as can be verified in df.

Issue: The actual matrix is about 500K by 500K and the for loop takes a really long time to finish. So, is there a vectorized or other efficient way of calculating matrix from df

Note: I would get MemoryError without using the whole Sparse dtype thing even with dtype=float16 in pd.DataFrame() so I prefer to keep that if possible. It would be great if the 500K by 500K matrix will not take up more than 10-12Gb of RAM. Also, if it matters, these percentages will always have a 0-100 range, obviously.


Solution

  • Option 1: pd.crosstab

    out = (pd.crosstab(index=df.index, 
                       columns=df['to'], 
                       normalize='index'
                       )
           .mul(100)
           .rename_axis(index='from', columns=None)
           )
    

    Output:

              2     4     5
    from                   
    1     100.0   0.0   0.0
    2       0.0  50.0  50.0
    

    Option 2: df.pivot_table

    out2 = (df.reset_index()
            .pivot_table(index='from', columns='to', 
                         values='to', aggfunc='size', fill_value=0)
            .div(df.index.value_counts(), axis=0)
            .mul(100)
            .rename_axis(columns=None)
            )
    
    out2.equals(out)
    # True
    

    Option 3: df.groupby

    out3 = (df.groupby(df.index)
            .value_counts(normalize=True)
            .unstack(-1, fill_value=0)
            .mul(100)
            .rename_axis(columns=None)
            )
    
    out3.equals(out)
    # True
    

    Edit:

    With a very sizeable df, I would consider creating a csr_matrix to avoid the MemoryError. In this case, we only use df.groupby + value_counts and use Index.factorize to pass compatible row_ind and col_ind values.

    from scipy.sparse import csr_matrix
    
    g = df.groupby(df.index).value_counts(normalize=True).mul(100)
    
    idx_codes, idx_unique = g.index.get_level_values(0).factorize()
    col_codes, col_unique = g.index.get_level_values(1).factorize() 
    
    m = csr_matrix((g.values, (idx_codes, col_codes)))
    

    If you want, you can turn that back into a df with df.sparse.from_spmatrix:

    df_m = pd.DataFrame.sparse.from_spmatrix(m, 
                                             index=idx_unique, 
                                             columns=col_unique)
    

    Output:

           2     4     5
    1  100.0   0.0   0.0
    2    0.0  50.0  50.0
    

    But understanding this, you can also continue with m and still use idx_unique and col_unique for slicing. E.g.:

    bool_idx = col_unique.isin(['4','5'])
    
    m[:, bool_idx].sum(axis=1)
    
    matrix([[  0.],
            [100.]])
    

    Compare:

    df_m.loc[:, bool_idx].sum(axis=1)
    
    1        0
    2    100.0
    dtype: Sparse[float64, 0]