Search code examples
pythonpandasprobability-distribution

why does applying probability distributions and transformations result in the same value?


I'm applying multiple Beta, Gamma and HalfNorm Transforms to each column of my pandas dataframe. The dataframe consists of marketing spend; each row indicates spend per week and each column indicates type of spend: enter image description here

The python functions and code to apply the transform is as follows:

def geometric_adstock_tt(
    x, alpha=0, L=12, normalize=True
):  # 12 (days) is the delay or lag we expect to see?
    """
    The term "geometric" refers to the way weights are assigned to past values,
    which follows a geometric progression.
    In a geometric progression,
    each term is found by multiplying the previous term by a fixed, constant ratio (commonly denoted as "r").
    In the case of the geometric adstock function, the "alpha" parameter serves as this constant ratio.
    """
    # vector of weights assigned by decay rate alpha set to be 12 weeks
    w = np.array([alpha**i for i in range(L)])
    xx = np.stack(
        [np.concatenate([np.zeros(i), x[: x.shape[0] - i]]) for i in range(L)]
    )

    if not normalize:
        y = np.dot(w, xx)
    else:
        y = np.dot(
            w / np.sum(w), xx
        )  # dot product to get marketing channel over time frame of decay
    return y


### non-linear saturation function
def logistic_function(x_t, mu=0.1):
    # apply the logistic function to spend variable
    return (1 - np.exp(-mu * x_t)) / (1 * np.exp(-mu * x_t))

#################
response_mean = []
# Create Distributions
halfnorm_dist = st.halfnorm(loc=0, scale=5)
# Create a beta distribution
beta_dist = st.beta(a=3, b=3)
# Create a gamma distribution
gamma_dist = st.gamma(a=3)

delay_channels = [
    'TV', 'Referral', 'DirectMail', 'TradeShows', 'SocialMedia','DisplayAds_Standard', 'ContentMarketing',
       'GoogleAds', 'SEO', 'Email', 'AffiliateMarketing',
]
non_lin_channels = ["DisplayAds_Programmatic"]
################ ADSTOCK CHANNELS
for channel_name in delay_channels:
    xx = df_in[channel_name].values
    print(f"Adding Delayed Channels: {channel_name}")

    # apply beta transform
    y = beta_dist.pdf(xx)

    # apply geometric adstock transform
    geo_transform = geometric_adstock_tt(y)

    # apply gamma transform
    z = gamma_dist.pdf(geo_transform)

    # apply logistic function transform
    log_transform = logistic_function(z)

    # apply halfnorm transform
    output = halfnorm_dist.pdf(geo_transform)
    
    # append output
    response_mean.append(list(output))
    
################# SATURATION ONLY
for channel_name in non_lin_channels:
    xx = df_in[channel_name].values
    
    # apply gamma transform
    z = gamma_dist.pdf(xx)

    # apply logistic function transform
    log_transform = logistic_function(z)

    # apply halfnorm transform
    output = halfnorm_dist.pdf(log_transform)
    
    # append output
    response_mean.append(list(output))

enter image description here

I'm not quite understanding why all values are being transformed to the same value. I would be so appreciative of any insight! Thanks so much:)


Solution

  • I believe what's happening is that the beta distribution you defined expects your data to be in the range 0 ≤ x ≤ 1 (see the notes for the beta distribution documentation), and anything outside of this range will have a pdf value of 0.

    So one possibility is to first min-max scale all of your columns to be in the range 0-1 using the following:

    df_in = (df_in-df_in.min())/(df_in.max()-df_in.min())
    

    Using some made up data:

    delay_channels = [
        'TV', 'Referral', 'DirectMail', 'TradeShows', 'SocialMedia','DisplayAds_Standard', 'ContentMarketing',
           'GoogleAds', 'SEO', 'Email', 'AffiliateMarketing',
    ]
    non_lin_channels = ["DisplayAds_Programmatic"]
    
    sample_dates = pd.date_range('2023-01-01','2024-01-01',freq='7D')
    sample_data_dict = {
        channel: 1000 + 100*np.random.rand(53) for channel in delay_channels+non_lin_channels
    }
    sample_data_dict['Date'] = sample_dates
    np.random.seed(42)
    df_in = pd.DataFrame(sample_data_dict)
    df_in = df_in.set_index('Date')
    df_in = (df_in-df_in.min())/(df_in.max()-df_in.min())
    

    After applying your transformations, I get the following:

    enter image description here