Search code examples
pythonpandasdataframevectorizationseries

Vectorizing a Function to Replicate Rows with Pandas


CONTEXT:

I have a DataFrame with a column and a function that duplicates a row based on the number in the column "count". My current method is very slow when working with larger datasets:

def replicate_row(df):
    for i in range(len(df)):
        row = df.iloc[i]
        if row['count']>0:
           rep = int(row['count'])-1
           if rep != 0:
               full_df = full_df.append([row]*rep, ignore_index=True)

I'm trying to figure out how to vectorize this function to run quicker and found this so far:

def vector_function(
    pandas_series: pd.Series) -> pd.Series:
    scaled_series = pandas_series['count'] - 1
    *** vectorized replication code here ? ***
    return scaled_series

SAMPLE DATA

Name    Age    Gender    Count
Jen     25     F         3
Paul    30     M         2

The expected outcome of DF would be:

Name    Age    Gender    
Jen     25     F         
Jen     25     F         
Jen     25     F         
Paul    30     M         
Paul    30     M         

Solution

  • Try using pd.Index.repeat:

    df = f.loc[df.index.repeat(df['Count'])].reset_index(drop=True).drop('Count', axis=1)
    

    Output:

    >>> df
       Name  Age Gender
    0   Jen   25      F
    1   Jen   25      F
    2   Jen   25      F
    3  Paul   30      M
    4  Paul   30      M