Search code examples
pythonpandasdataframeoptimizationpython-itertools

Optimizing DataFrame Processing for User Login Intervals with Pandas


I'm working on a function that processes the login times of users on a website. These times are stored in a Pandas DataFrame where the first column indicates a time interval, and the rest indicate if a user was logged in during that interval. My program acts on this DataFrame and groups users, creating as many columns as there are possible combinations of users. It then checks if those users were connected during the time interval defined by the row.

For example, in a case with 3 users, A, B, and C, if A and B are logged in at a specific row, the column A,B will have a 1 while the columns for A and B will be 0. If all three users are active at the same time, the column A,B,C will have a 1, and the rest will be 0.

In my real case, there are many columns, so the exponential cost of the function makes it prohibitive. I have been trying to generate code that looks for groups that never coincide to avoid constructing redundant columns. For instance, if B and C never have a row with a value of 1 in common, it doesn't make sense to generate the columns B,C or A,B,C.

I tried using GitHub Copilot, but it hasn't been able to provide a useful solution. Could someone help me optimize my code?

Here is the code I'm using:

def process_dataframe_opt2(df):
    # List of columns for combinations, excluding 'fecha_hora'
    columns = [col for col in df.columns if col != 'fecha_hora']
    
    # Generate all possible combinations of the columns
    for r in range(1, len(columns) + 1):
        for comb in combinations(columns, r):
            col_name = ','.join(comb)
            df[col_name] = df[list(comb)].all(axis=1).astype(int)
    
    # Create a copy of the original DataFrame to modify it
    df_copy = df.copy()
    
    # Process combinations from largest to smallest
    for r in range(len(columns), 1, -1):
        for comb in combinations(columns, r):
            col_name = ','.join(comb)
            active_rows = df[col_name] == 1
            if active_rows.any():
                for sub_comb in combinations(comb, r-1):
                    sub_col_name = ','.join(sub_comb)
                    df_copy.loc[active_rows, sub_col_name] = 0
    
    # Remove columns that only contain 0
    df_copy = df_copy.loc[:, (df_copy != 0).any(axis=0)]
    
    return df_copy

And to generate an example

import pandas as pd
from itertools import combinations
# Create a range of time
date_rng = pd.date_range(start='2024-05-13 15:52:00', end='2024-05-13 16:04:00', freq='min')

# Create an empty DataFrame
df = pd.DataFrame(date_rng, columns=['fecha_hora'])

# Add the login columns with corresponding values
df['A'] = 1  # Always active
df['B'] = [1] * 6 + [0] * 5 + [1] * 2  # Active in the first 6 intervals
df['C'] = [1] * 5 + [0] * 6 + [1] * 2  # Active in the first 5 intervals
df['D'] = [1] * 4 + [0] * 7 + [1] * 2  # Active in the first 4 intervals
df['E'] = [1] * 3 + [0] * 3 + [1] * 2 + [0] * 3 + [1]*2  # Active in two blocks
df['F'] = [0] * 7 + [1] * 3 + [0] * 3  # Active in a single block towards the end
df['alfa'] = [0] * 10 + [1] * 1 + [0] * 2

# Adjust some rows to have more than one '1'
df.loc[1, ['A', 'B', 'C']] = 1  # Row with multiple '1's
df.loc[8, ['D', 'E', 'F']] = 1  # Another row with multiple '1's

df_copy = process_dataframe_opt2(df)

Could anyone provide insights or suggestions on how to optimize this function to avoid the exponential cost and improve performance?


Solution

  • Each record of the expected result contains a single field equal to 1 that is located in the column named by active users, while the rest of fields are equal to 0. So here's what we can do to produce this table:

    • replace each record by the sequence of active users;
    • use pivot, unstack, get_dummies, etc. to convert the obtained series to a data frame.

    Example with get_dummies:

    result = pd.get_dummies(
        df.astype(bool).apply(lambda x: ','.join(df.columns[x]), axis=1),
        dtype=int
    )
    

    Example with unstack:

    result = (
        df.astype(bool)
        .apply(lambda x: ','.join(df.columns[x]), axis=1)   # select active users
        .to_frame('users')                                            
        .set_index('users', append=True)       # push users to the second index level
        .assign(mark=1)                        # mark records before pivoting
        .squeeze()
        .unstack(fill_value=0)
    )
    

    Code for experiments:

    import pandas as pd
    
    data = { 
        'A': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        'B': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1],
        'C': [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1],
        'D': [1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1],
        'E': [1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1],
        'F': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0],
        'alfa': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
    }
    index = pd.date_range(
        start='2024-05-13 15:52:00', 
        periods=len(data['A']), freq='min'
    )
    df = pd.DataFrame(data, index)
    
    # option with get dummies
    result1 = pd.get_dummies(
        df.astype(bool).agg(lambda x: ','.join(df.columns[x]), axis=1)
        , dtype=int
    )
    
    # option with unstacking
    result2 = (
        df.astype(bool)
        .apply(lambda x: ','.join(df.columns[x]), axis=1)
        .to_frame(None)
        .set_index(None, append=True)
        .assign(just_mark=1).squeeze()
        .unstack(fill_value=0)
    )
    
    assert result1.equals(result2)
    
    >>> print(result1)
                         A,B  A,B,C  A,B,C,D  A,B,C,D,E  A,D,E,F  A,E  A,E,F  A,F  A,alfa
    2024-05-13 15:52:00    0      0        0          1        0    0      0    0       0
    2024-05-13 15:53:00    0      0        0          1        0    0      0    0       0
    2024-05-13 15:54:00    0      0        0          1        0    0      0    0       0
    2024-05-13 15:55:00    0      0        1          0        0    0      0    0       0
    2024-05-13 15:56:00    0      1        0          0        0    0      0    0       0
    2024-05-13 15:57:00    1      0        0          0        0    0      0    0       0
    2024-05-13 15:58:00    0      0        0          0        0    1      0    0       0
    2024-05-13 15:59:00    0      0        0          0        0    0      1    0       0
    2024-05-13 16:00:00    0      0        0          0        1    0      0    0       0
    2024-05-13 16:01:00    0      0        0          0        0    0      0    1       0
    2024-05-13 16:02:00    0      0        0          0        0    0      0    0       1
    2024-05-13 16:03:00    0      0        0          1        0    0      0    0       0
    2024-05-13 16:04:00    0      0        0          1        0    0      0    0       0