Search code examples
pythonrlogistic-regressionstatsmodelssurvey

Trying to understand differences in weighted logistic regression outputs between Statsmodels and R's survey & srvyr packages


I have a fictional weighted survey dataset that contains information about respondents' car colors and their response to the question "I enjoy driving fast." I would like to perform a regression to see whether the likelihood of slightly agreeing with this question varies based on whether or not the respondent drives a black car. (This isn't a serious analysis; I'm just introducing it for the purpose of comparing weighted regression outputs in R and Python.)

In order to answer this question, I first ran a weighted logistic regression using R's survey and srvyr packages. This regression provided a test statistic of -1.18 for the black car color coefficient and a p value of 0.238. However, when I ran a weighted logistic regression within statsmodels, I received a test statistic of -1.35 and a p value of 0.177 for this coefficient. I would like to understand why these test statistics are different, and whether I'm making any mistakes within my setup for either test that could explain this discrepancy.

It's also worth noting that, when I removed the weight component from each test, my test statistics and P values were almost identical. Therefore, it seems that these two implementations are treating my survey weights differently.

Here is my code: (Note: I am running my R tests within rpy2 so that I can execute them in the same notebook as my Python code. You should be able to reproduce this output on your end within a Jupyter Notebook as long as you have rpy2 set up on your end.)

import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
%load_ext rpy2.ipython
%R library(dplyr)
%R library(srvyr)
%R library(survey)
%R library(broom)
import pandas as pd

df_car_survey = pd.read_csv(
    'https://raw.githubusercontent.com/ifstudies/\
carsurveydata/refs/heads/main/car_survey.csv')

# Adding dummy columns for independent and dependent variables:

for column in ['Car_Color', 'Enjoy_Driving_Fast']:
    df_car_survey = pd.concat([df_car_survey, pd.get_dummies(
        df_car_survey[column], dtype = 'int', prefix = column)],
             axis = 1)

df_car_survey.columns = [column.replace(' ', '_') for column in 
                         df_car_survey.columns]


# Loading DataFrame into R and creating a survey design object:
# See https://tidy-survey-r.github.io/tidy-survey-book/c10-sample-designs-replicate-weights.html
# for more details.
# This book was also inval
%Rpush df_car_survey
%R df_sdo <- df_car_survey %>% as_survey_design(\
    weights = 'Weight')

print("Survey design object:")
%R print(df_sdo)
    
# Logistic regression in R:
# (This code was based on that found in 
# https://tidy-survey-r.github.io/tidy-survey-book/c07-modeling.html )
%R logit_result <- svyglm(\
    Enjoy_Driving_Fast_Slightly_Agree ~ Car_Color_Black, \
design = df_sdo, na.action = na.omit,\
family = quasibinomial())
print("\n\n Logistic regression results from survey package within R:")
%R print(tidy(logit_result))

# Logistic regression within Python:
# (Based on StupidWolf's response at
# https://stackoverflow.com/a/62798889/13097194 )
glm = smf.glm("Enjoy_Driving_Fast_Slightly_Agree ~ Car_Color_Black",
        data = df_car_survey, 
        family = sm.families.Binomial(),
        freq_weights = df_car_survey['Weight'])
model_result = glm.fit()
print("\n\nStatsmodels logistic regression results:")
print(model_result.summary())

Output:

Survey design object:
Independent Sampling design (with replacement)
Called via srvyr
Sampling variables:
  - ids: `1` 
  - weights: Weight 
Data variables: 
  - Car_Color (chr), Weight (dbl), Enjoy_Driving_Fast (chr), Count (int),
    Response_Sort_Map (int), Car_Color_Black (int), Car_Color_Red (int),
    Car_Color_White (int), Enjoy_Driving_Fast_Agree (int),
    Enjoy_Driving_Fast_Disagree (int), Enjoy_Driving_Fast_Slightly_Agree (int),
    Enjoy_Driving_Fast_Slightly_Disagree (int),
    Enjoy_Driving_Fast_Strongly_Agree (int),
    Enjoy_Driving_Fast_Strongly_Disagree (int)


 Logistic regression results from survey package within R:
# A tibble: 2 × 5
  term            estimate std.error statistic  p.value
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)       -2.08      0.145    -14.3  8.85e-43
2 Car_Color_Black   -0.293     0.248     -1.18 2.38e- 1


Statsmodels logistic regression results:
                         Generalized Linear Model Regression Results                         
=============================================================================================
Dep. Variable:     Enjoy_Driving_Fast_Slightly_Agree   No. Observations:                 1059
Model:                                           GLM   Df Residuals:                     1057
Model Family:                               Binomial   Df Model:                            1
Link Function:                                 Logit   Scale:                          1.0000
Method:                                         IRLS   Log-Likelihood:                -345.24
Date:                               Tue, 04 Feb 2025   Deviance:                       690.48
Time:                                       10:07:53   Pearson chi2:                 1.06e+03
No. Iterations:                                    5   Pseudo R-squ. (CS):           0.001763
Covariance Type:                           nonrobust                                         
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          -2.0834      0.125    -16.693      0.000      -2.328      -1.839
Car_Color_Black    -0.2931      0.217     -1.350      0.177      -0.719       0.133
===================================================================================

Solution

  • As suggested in the comments by George Savva, DaveArmstrong, and BenBolker, I believe the issue here is that that my R code is interpreting the weights as sampling weights (which makes sense, since I'm applying survey analysis libraries), whereas Statsmodels is interpreting them as frequency weights.

    For further evidence of this, I compared the output of a chi squared test using R's survey/srvyr libraries with those in Python's Samplics library, which is built for survey analyses. Here is the code that I used (which builds off that shown above):

    %R svychisq_result <- svychisq(\
        ~ Enjoy_Driving_Fast_Slightly_Agree + Car_Color_Black, \
    design = df_sdo, na.action = na.omit)
    print("\n\n Chi squared results from survey package within R:")
    %R print(svychisq_result)
    
    from samplics.categorical import CrossTabulation
    from samplics.utils.types import PopParam
    
    crosstab = CrossTabulation(param=PopParam.prop)
    crosstab.tabulate(
        vars=df_car_survey[["Enjoy_Driving_Fast_Slightly_Agree", 
                            "Car_Color_Black"]],
        samp_weight=df_car_survey['Weight'],
        remove_nan=True
    )
    # Based on
    # https://samplics-org.github.io/samplics/pages/
    # categorical_tabulation.html#two-way-tabulation-cross-tabulation
    print("Chi squared results from Python's Samplics library:",crosstab)
    

    The F statistic and p value (1.4033 and 0.2364, respectively) generated by the survey/srvyr libraries matched the adjusted F statistic and p value created by the Samplics library.

    Once logistic regression becomes available within Samplics, I will plan to use that library to perform regression analyses of survey data within Python. In the meantime, unless there's an alternative Python approach that would let me specify sampling weights (and not frequency weights) for my regression analyses, it looks like I'll need to exclusively use R for those tests.