Search code examples
pythonpandasloggingtransformationpearson-correlation

How do I calculate lambda to use scipy.special.boxcox1p function for my entire dataframe of 500 columns?


I have a dataframe with total sales of around 500 product categories in each row. So there are 500 columns in my dataframe. I am trying to find the highest correlated category with my another dataframe columns. So I will use Pearson correlation method for this. But the Total sales for all the categories are highly skewed data, with the skewness level ranging from 10 to 40 for all the category columns. So I want to log transform this sales data using boxcox transformation. Since, my sales data has 0 values as well, I want to use boxcox1p function. Can somebody help me, how do I calculate lambda for boxcox1p function, since it is a mandatory parameter for this function? Also, Is this the correct approach for my problem statement to find highly correlated categories?


Solution

  • Assume df is Your dataframe with many columns containing numeric values, and lambda parameter of box-cox transformation equals 0.25, then:

    from scipy.special import boxcox1p
    df_boxcox = df.apply(lambda x: boxcox1p(x,0.25))
    
    

    Now transformed values are in df_boxcox.

    Unfortunately there is no built-in method to find lambda of boxcox1p but we can use PowerTransformer from sklearn.preprocessing instead:

    import numpy as np
    from sklearn.preprocessing import PowerTransformer
    pt = PowerTransformer(method='yeo-johnson')
    

    Note method 'yeo-johnson' is used because it works with both positive and negative values. Method 'box-cox' will raise error: ValueError: The Box-Cox transformation can only be applied to strictly positive data.

    data = pd.DataFrame({'x':[-2,-1,0,1,2,3,4,5]}) #just sample data to explain
    pt.fit(data)
    print(pt.lambdas_)
    [0.89691707]
    

    then apply calculated lambda:

    print(pt.transform(data))
    

    result:

    [[-1.60758267]
     [-1.09524803]
     [-0.60974999]
     [-0.16141745]
     [ 0.26331586]
     [ 0.67341476]
     [ 1.07296428]
     [ 1.46430326]]