CONTEXT:
I have a DataFrame with a column and a function that duplicates a row based on the number in the column "count". My current method is very slow when working with larger datasets:
def replicate_row(df):
for i in range(len(df)):
row = df.iloc[i]
if row['count']>0:
rep = int(row['count'])-1
if rep != 0:
full_df = full_df.append([row]*rep, ignore_index=True)
I'm trying to figure out how to vectorize this function to run quicker and found this so far:
def vector_function(
pandas_series: pd.Series) -> pd.Series:
scaled_series = pandas_series['count'] - 1
*** vectorized replication code here ? ***
return scaled_series
SAMPLE DATA
Name Age Gender Count
Jen 25 F 3
Paul 30 M 2
The expected outcome of DF would be:
Name Age Gender
Jen 25 F
Jen 25 F
Jen 25 F
Paul 30 M
Paul 30 M
Try using pd.Index.repeat
:
df = f.loc[df.index.repeat(df['Count'])].reset_index(drop=True).drop('Count', axis=1)
Output:
>>> df
Name Age Gender
0 Jen 25 F
1 Jen 25 F
2 Jen 25 F
3 Paul 30 M
4 Paul 30 M