Search code examples
rlogistic-regression

R newbie (learning on the job) running into errors when trying to modify previous programmer's regression


I am a researcher running a binomial regression (and coding and doing statistics) for the first time ever for work - it's been an experience! I took over this project for work midway through, so did not develop the initial coding myself. I've never coded before so I've been learning R as I go. However, I've run into an error issue that I cannot figure out (although I suspect it's likely pretty simple), and any help would be GREATLY appreciated. I've laid it out in more detail below and can attach screenshots if helpful.

The initial dataset was 1,276 individuals (rows), each responding to a selection from 188 questions (columns). I have since been asked to add responses to 8 further questions to this initial dataset, meaning 196 questions (columns) for the final dataset. Overall, there have only have only ever been 9 columns, and that remains unchanged. However, I am having an issue with adjusting my code to account for the addition of these new columns.

Any ideas welcome with respect to what might be causing the mismatch of rows!

For example, my first code, which would run:

Ans_Data = read_xlsx("DSM Data 15.2.23 IB v4.xlsx",
  sheet = "CHANGED Tab 2 - AR weighted",
  range = "A12:GG1290", col_names = F, col_types = c("text",rep("numeric",188)))
Question_Data = t(read_xlsx("DSM Data 15.2.23 IB v4.xlsx",
  sheet = "CHANGED Tab 2 - AR weighted",
  range = "A1:GG10", col_names = T))

colnames(Question_Data) = Question_Data[1,] 
Question_Data = Question_Data[-1,] 
Question_Data = data.table(Question_Data)

Ans_Data_2 = Ans_Data %>%
  pivot_longer(cols = colnames(Ans_Data)[2:189])

for (i in 1:1278) {
  if (i==1) {
    Question_Data_2 = rbind(Question_Data,Question_Data)
  } else {
    Question_Data_2 = rbind(Question_Data_2,Question_Data)
  }
}

Ans_Data_3 = cbind(Ans_Data_2, Question_Data_2)

However, my updated code:

Ans_Data = read_xlsx("DSM Data 15.2.23 DP v5.xlsx",
  sheet = "CHANGED Tab 2 - AR weighted",
  range = "A12:GO1287", col_names = F,col_types = c("text",rep("numeric",196)))
Question_Data = t(read_xlsx("DSM Data 15.2.23 DP v5.xlsx", 
  sheet = "CHANGED Tab 2 - AR weighted",
  range = "A1:GO10", col_names = T))

colnames(Question_Data) = Question_Data[1,] 
Question_Data = Question_Data[-1,] 
Question_Data = data.table(Question_Data)

Ans_Data_2 = Ans_Data %>%
  pivot_longer(cols = colnames(Ans_Data)[2:197])

for (i in 1:1278) {
  if (i==1) {
    Question_Data_2 = rbind(Question_Data,Question_Data)
  } else {
    Question_Data_2 = rbind(Question_Data_2,Question_Data)
  }
}

Ans_Data_3 = cbind(Ans_Data_2, Question_Data_2)

produces the following error:

Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 250096, 250684

Solution

  • so sry I write an answer (can't comment yet). Got to your code by chance and it somehow got my attention. Anyway, your error is pretty simple.You are trying to "column bind" (cbind) or bind together two dataframes with different numbers of rows. Now where that comes from is another question.

    So reading your code you import two Datasets:

    Ans_Data = read_xlsx("DSM Data 15.2.23 DP v5.xlsx", sheet = "CHANGED Tab 2 - AR weighted", range = "A12:GO1287", col_names = F,col_types = c("text",rep("numeric",196)))
    

    and

    Question_Data = t(read_xlsx("DSM Data 15.2.23 DP v5.xlsx", sheet = "CHANGED Tab 2 - AR weighted", range = "A1:GO10", col_names = T)).
    

    From the naming of the dataset I assume that Ans_Data are the responses; This is a Dataset of 197 columns (A to GO) and 1276 rows (12 to 1287). You later pivot that dataframe into long format; In your case that creates a dataframe with 250096 rows. This results from 196 (from 2:197) columns times the 1276 rows.

    The second dataset (Question_Data) is a dataframe that has transposed (the t) 10 columns and 197 rows. You than use the first line of that dataframe as colnames and exclude it leaving 196 rows. You later run a loop that for case i = 1 copies (row binds) 196 rows to the end of the Question_Data dataframe resulting in 392 rows. You than repeat that process for case i > 1 1277 times. The resulting dataframe Question_Data therefore has 392 + 196 * 1277 or 250684 rows.

    Your datasets have 250096 und 250684 rows; So as mentioned cbind gives an error. Assuming Question_Data gives the design matrix und Ans_Data the responses, the code was probably built to merch the design matrix to the responses. Given you want 196 responses from 1276 individuals this should be 250096 rows (from 196 times 1276). So i would suggest that the sequence you loop through is to long and it should be 1:1275? Sry 1275 because its doubled in the if clause.