python apache-spark group-by azure-databricks

I want to speed up For loop in python and want to use less memory

Hi below is the function i am using to calculate quantile(25/50/75) & mean for each column.

def newsummary(final_per,grp_lvl,col):
  new_col1='_'.join([j]+grp_lvl+['25%'])
  new_col2='_'.join([j]+grp_lvl+['50%'])
  new_col3='_'.join([j]+grp_lvl+['75%'])
  new_col4='_'.join([j]+grp_lvl+['mean'])
  final_per1=pd.DataFrame()
  
  final_per1=final_per.groupby(grp_lvl)[j].quantile(0.25).reset_index()
  final_per1.rename(columns = {j:new_col1}, inplace = True)
  final_per2[new_col1]=final_per1[new_col1].copy()
  
  final_per1=final_per.groupby(grp_lvl)[j].quantile(0.5).reset_index()
  final_per1.rename(columns = {j:new_col2}, inplace = True)
  final_per2[new_col2]=final_per1[new_col2].copy()

  final_per1=final_per.groupby(grp_lvl)[j].quantile(0.75).reset_index()
  final_per1.rename(columns = {j:new_col3}, inplace = True)
  final_per2[new_col3]=final_per1[new_col3].copy()

  final_per1=final_per.groupby(grp_lvl)[j].mean().reset_index()
  final_per1.rename(columns = {j:new_col4}, inplace = True)
  final_per2[new_col4]=final_per1[new_col4].copy()
  
  return final_per2

Calling the function

grp_lvl=['ZIP_CODE', 'year']

for j in list_col:   # approximately 1400 columns to iterate
  per=newsummary(final_per,grp_lvl,j)

I want to find quantile(25/50/75) & mean for each column n retain that column in new dataframe. This i have to do for around 1400 columns.

The Dataframe are pandas dataframe.

While executing this loop .copy() command causing performance issues. Are there any alternate ways to reduce performance issues and not face out of memory error. Your help & suggestion are appreciated.

** Note:- i am using Azure Databricks cluster to execute this. **

Solution

Since you are creating 4 new columns for each of the column i.e., quantiles for 0.25, 205 0.75 and mean of the grouped data using pandas dataframe, the code that you are following might be a better choice.

The pyspark grouped data requires an aggregate function and there is no aggregate function to calculate quantile.
There is no need to use copy or return any value from the function. So, modify your code to the below code:

import pandas as pd
final_per2 = pd.DataFrame()
def newsummary(final_per,grp_lvl,col):
    new_col1='_'.join([j]+grp_lvl+['25%'])
    new_col2='_'.join([j]+grp_lvl+['50%'])
    new_col3='_'.join([j]+grp_lvl+['75%'])
    new_col4='_'.join([j]+grp_lvl+['mean'])
    final_per1=pd.DataFrame()
  
    final_per1=final_per.groupby(grp_lvl)[j].quantile(0.25).reset_index()
    final_per1.rename(columns = {j:new_col1}, inplace = True)
    final_per2[new_col1]=final_per1[new_col1]
  
    final_per1=final_per.groupby(grp_lvl)[j].quantile(0.5).reset_index()
    final_per1.rename(columns = {j:new_col2}, inplace = True)
    final_per2[new_col2]=final_per1[new_col2]

    final_per1=final_per.groupby(grp_lvl)[j].quantile(0.75).reset_index()
    final_per1.rename(columns = {j:new_col3}, inplace = True)
    final_per2[new_col3]=final_per1[new_col3]

    final_per1=final_per.groupby(grp_lvl)[j].mean().reset_index()
    final_per1.rename(columns = {j:new_col4}, inplace = True)
    final_per2[new_col4]=final_per1[new_col4]


for j in cols:   # approximately 1400 columns to iterate
    newsummary(pdf,grp_lvl,j)
final_per2