Search code examples
pythonrlogistic-regressionstatsmodels

Meaning of cbind()


I have this code in R:

mp <- max_power$max
names(mp) <- max_power$athlete
x$normalized_power <- as.vector(x$power / mp[x$athlete])
x$temp_dist <- abs(x$temperature - 60)
x$log_time <- log(x$time)

normalized_power <- x$normalized_power

fit <- glm(
  cbind(normalized_power, 1 - normalized_power) ~ log_time + temp_dist,
  family = binomial(),
  data = x
)

I understand everything except for the cbind() in the equation. I am trying to replicate this using python:

        self.pmax = max(self.power_array)

        normalized_power = pd.DataFrame([x / self.pmax for x in self.power_array])
        one_minus_np = pd.DataFrame([1 - x for x in [x / self.pmax for x in self.power_array]])

        log_time = [math.log(x) for x in self.time_array]
        temp_diff = [abs(x - 60) for x in self.temp_array]

        df = pd.concat([normalized_power, one_minus_np], axis=1)
        df.columns = ["np", "1-np"]
        df["log_time"] = log_time
        df["temp_diff"] = temp_diff

        model = sm.glm("np ~ log_time + temp_diff", df, family=family.Binomial()).fit()

Can someone explain what the cbind() is doing and how it can be replicated?


Solution

  • Usually when you run a logistic regression, your outcome variable (on the left of the regression formula) is a simple TRUE/FALSE (or 1/0 or success/failure) column that indicates absence or presence of the characteristic you are trying to model.

    However, we can sometimes have the situation where we have a column of counts of successes and another column of counts of failures.

    For example, suppose I get 8 men to take 10 shots at a basketball hoop and record their scores. I also measure their height because I want to know whether this predicts their accuracy.

    My data might look something like this:

    baskets <- data.frame(height = c(1.5, 1.95, 1.8, 1.76, 1.52, 1.91, 1.66, 1.68),
                          shots_on_target = c(4, 9, 6, 8, 3, 9, 5, 5))
    
    baskets
    #>   height shots_on_target
    #> 1   1.50               4
    #> 2   1.95               9
    #> 3   1.80               6
    #> 4   1.76               8
    #> 5   1.52               3
    #> 6   1.91               9
    #> 7   1.66               5
    #> 8   1.68               5
    

    If I want to run a logistic regression on these data, I need to pass a column of successes and a column of failures as the outcome variable. Fortunately, glm allows us to do just that. All we need to do is bind the columns together with cbind - this will convert the success / failures columns into a single 2-column matrix.

    Of course, I don't have a failures column, but since I know that each person had 10 shots, I can easily create it by doing 10 - shots_on_target

    Therefore, my model can be created like so:

    model <- glm(cbind(shots_on_target, 10 - shots_on_target) ~ height, 
                 data = baskets, family = binomial)
    
    summary(model)
    #> 
    #> Call:
    #> glm(formula = cbind(shots_on_target, 10 - shots_on_target) ~ 
    #>     height, family = binomial, data = baskets)
    #> 
    #> Deviance Residuals: 
    #>      Min        1Q    Median        3Q       Max  
    #> -0.94529  -0.31349   0.00671   0.52595   0.80363  
    #> 
    #> Coefficients:
    #>             Estimate Std. Error z value Pr(>|z|)    
    #> (Intercept)  -10.094      3.038  -3.323 0.000892 ***
    #> height         6.182      1.788   3.457 0.000546 ***
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> (Dispersion parameter for binomial family taken to be 1)
    #> 
    #>     Null deviance: 16.944  on 7  degrees of freedom
    #> Residual deviance:  2.564  on 6  degrees of freedom
    #> AIC: 26.534
    #> 
    #> Number of Fisher Scoring iterations: 4
    

    and we can see that height was positively predictive of the number of shots on target.

    In your example, the format for the outcome variable is similar to my example. However, it doesn't make a lot of sense for the numbers to be normalized to between 0 and 1. The two input columns should really be integers for this approach to make sense.

    Created on 2021-11-04 by the reprex package (v2.0.0)