Search code examples
pythonab-testingt-testhypothesis-testexperimental-design

How to compute the confidence interval of the Difference in Differences method using Python?


I'm trying to analyze the total active minutes per user before and after an experiment. Here I've included the associated user data before and after the experiment - variant_number = 0 indicates control group while 1 means treatment group. Specifically, I'm interested in the mean (average total active minutes per user).

First, I calculated the before-after difference in treatment outcome and the before-after difference in control outcome (-183.7 and 19.4 respectively). The difference in differences = 203.1 in this case.

I'm wondering how I can use Python to construct a 95% confidence interval of the difference in differences? (I can provide more code/context if needed)


Solution

  • You can use a linear model and measure the interaction effect (group[T.1]:period[T.pre] below). The average difference in differences for these simulated data is -223.1779, the p-value for the interaction is p < 5e-4 so highly significant and the 95% confidence interval is [-276.360, -169.995].

    import statsmodels.api as sm
    import statsmodels.formula.api as smf
    import pandas as pd
    import numpy as np
    np.random.seed(14)
    
    minutes_0_pre = np.random.normal(loc=478, scale=1821, size=39776)
    minutes_1_pre = np.random.normal(loc=275, scale=1078, size=9921)
    
    minutes_0_post = np.random.normal(loc=458, scale=1653, size=37425)
    minutes_1_post = np.random.normal(loc=458, scale=1681, size=9208)
    
    df = pd.DataFrame({'minutes': np.concatenate((minutes_0_pre, minutes_1_pre, minutes_0_post, minutes_1_post)),
                       'group': np.concatenate((np.repeat(a='0', repeats=minutes_0_pre.size),
                                                np.repeat(a='1', repeats=minutes_1_pre.size),
                                                np.repeat(a='0', repeats=minutes_0_post.size),
                                                np.repeat(a='1', repeats=minutes_1_post.size))),
                       'period': np.concatenate((np.repeat(a='pre', repeats=minutes_0_pre.size + minutes_1_pre.size),
                                                np.repeat(a='post', repeats=minutes_0_post.size + minutes_1_post.size)))
                       })
    model = smf.glm('minutes ~ group * period', df, family=sm.families.Gaussian()).fit()
    print(model.summary())
    

    Output:

    Generalized Linear Model Regression Results                  
    ==============================================================================
    Dep. Variable:                minutes   No. Observations:                96330
    Model:                            GLM   Df Residuals:                    96326
    Model Family:                Gaussian   Df Model:                            3
    Link Function:               identity   Scale:                      2.8182e+06
    Method:                          IRLS   Log-Likelihood:            -8.5201e+05
    Date:                Mon, 18 Jan 2021   Deviance:                   2.7147e+11
    Time:                        23:05:53   Pearson chi2:                 2.71e+11
    No. Iterations:                     3                                         
    Covariance Type:            nonrobust                                         
    ============================================================================================
                                   coef    std err          z      P>|z|      [0.025      0.975]
    --------------------------------------------------------------------------------------------
    Intercept                  456.2792      8.678     52.581      0.000     439.271     473.287
    group[T.1]                  14.9314     19.529      0.765      0.445     -23.344      53.207
    period[T.pre]               21.7417     12.089      1.798      0.072      -1.953      45.437
    group[T.1]:period[T.pre]  -223.1779     27.134     -8.225      0.000    -276.360    -169.995
    ============================================================================================
    

    EDIT:

    Since your summary statistics show that your distribution is heavily skewed, bootstrapping is actually a more reliable method to estimate confidence intervals:

    r = 1000
    bootstrap = np.zeros(r)
    
    for i in range(0, r):
        sample_index = np.random.choice(a=range(0, df.shape[0]), size=df.shape[0], replace=True)
        df_sample = df.iloc[sample_index]
        model = smf.glm('minutes ~ group * period', df_sample, family=sm.families.Gaussian()).fit()
        bootstrap[i] = model.params.iloc[3]  # interaction
    
    bootstrap = pd.DataFrame(bootstrap, columns=['interaction'])
    print(bootstrap.quantile([0.025, 0.975]).T)
    

    Output:

                      0.025       0.975
    interaction -273.524899 -175.373177