Search code examples
rloopsmodellogistic-regressionglm

How to run regression for several random subsets of data


I have a big data set and I want to choose randomly subsets (randomly_live) from it and then run a model (logistic regression) in R. So I want to run 100 logistic regressions to count how many times coefficients were with positive sign, haw many times they were significant and show the best model by Hosmer-Lemeshow criteria.

I think it's possible to make it by loop, but I feel really confused with that.

This is a piece of code that I have for one iteration

    randomRows = function(df,n){
      return(df[sample(nrow(df),n),])
    }


    set.seed(567)
    df.split <- split(full_data, full_data$ID)


    df.sample <- lapply(df.split, randomRows, 1)
    df.final <- do.call("rbind", df.sample)
    randomly_live <- randomRows(df.final, nrow(default))
    data1 <- rbind(default, randomly_live)


    model = glm(default ~ log(assets)+…+H1, data = data1,
                  family = 'binomial')


    library(ResourceSelection)

    hl <- hoslem.test(model$y, fitted(model), g=10)

Can anyone please help?


Solution

  • Here is something that could work

    myResults <- list()
    
    for(i in 1:100){
      model <- glm(vs ~ . , data = mtcars)
      hl <- hoslem.test(model$y, fitted(model), g=10)
      pos <- length(which(coef(model)>0))
      pvals <- summary(model)$coefficients[,4]
      hl_pval <- hl$p.value
      myResults[[i]] <- list(pos = pos, pvals = pvals,hl_pval=hl_pval)
    }
    
    # lowest pvalue
    which.min(unlist(lapply(myResults, FUN = function(x) x[[3]])))