Search code examples
pythonlogistic-regressionstatsmodelssample

Is there a way to implement sample weights?


I'm using statsmodels for logistic regression analysis in Python. For example:

import statsmodels.api as sm
import numpy as np
x = arange(0,1,0.01)
y = np.random.rand(100)
y[y<=x] = 1
y[y!=1] = 0
x = sm.add_constant(x)
lr = sm.Logit(y,x)
result = lr.fit().summary()

But I want to define different weightings for my observations. I'm combining 4 datasets of different sizes, and want to weight the analysis such that the observations from the largest dataset do not dominate the model.


Solution

  • Took me a while to work this out, but it is actually quite easy to create a logit model in statsmodels with weighted rows / multiple observations per row. Here's how's it's done:

    import statsmodels.api as sm
    logmodel=sm.GLM(trainingdata[['Successes', 'Failures']], trainingdata[['const', 'A', 'B', 'C', 'D']], family=sm.families.Binomial(sm.families.links.logit)).fit()