Search code examples
rloopsapply

Using apply() to select specific variables by name


Ok, basically I have a dataset of households that looks like this:



household_data <- data.frame(
                                id = 1:4,
                                gender_component_1 = c(1,2,2,2),
                                gender_component_2 = c(2,1,1,2),
                                bread_winner      = c(1,1,2,1)
)


I want to construct a variable ('gender_bread_winner') which reports the sex of the breadwinner in the family - whether component 1 or 2 , which is reported in a separate variable as a numeric.

I've come up with the following loop:

var_max <- paste("gender_component", household_data$bread_winner, sep = "_")

for (i in 1:nrow(household_data)) {
  household_data$gender_bread_winner[i] <- select(household_data[i,], var_max[i])
 }

Unfortunately, the real dataset is huge and this solution is not at all optimal, I was wondering whether is it possible to do the same thing using apply or similar? I've not been able to though.

Thanks in advance

EDIT : Thank you all for your answers! In the end I found easier to use a score of ifelses like this:


dataset$sesso_max <- NA
dataset$sesso_max <- ifelse(dataset$max_percettore == 1, dataset$sesso_1, dataset$sesso_max)
dataset$sesso_max <- ifelse(dataset$max_percettore == 2, dataset$sesso_2, dataset$sesso_max)
dataset$sesso_max <- ifelse(dataset$max_percettore == 3, dataset$sesso_3, dataset$sesso_max)
dataset$sesso_max <- ifelse(dataset$max_percettore == 4, dataset$sesso_4, dataset$sesso_max)
dataset$sesso_max <- ifelse(dataset$max_percettore == 5, dataset$sesso_5, dataset$sesso_max)
dataset$sesso_max <- ifelse(dataset$max_percettore == 6, dataset$sesso_6, dataset$sesso_max)


Solution

  • If there are only 2 gender_component columns a simple ifelse would do.

    household_data <- transform(household_data, gender_bread_winner  = 
            ifelse(bread_winner == 1, gender_component_1, gender_component_2))
    

    This says that when bread_winner has value 1 take the value from gender_component_1 or else take it from gender_component_2 column.


    For more than 2 columns we may use max.col as follows -

    gender_cols <- grep('gender_component', names(household_data), value = TRUE)
    household_data$gender_bread_winner <- household_data[gender_cols]
                 [cbind(1:nrow(household_data), household_data$bread_winner)]
    household_data
    
    #  id gender_component_1 gender_component_2 bread_winner gender_bread_winner
    #1  1                  1                  2            1                   1
    #2  2                  2                  1            1                   2
    #3  3                  2                  1            2                   1
    #4  4                  2                  2            1                   2
    

    Explanation for the answer -

    gender_cols has all the columns that have "gender_component" in them.

    gender_cols
    #[1] "gender_component_1" "gender_component_2"
    

    We create a matrix with row and column index to subset from the dataframe household_data.

    cbind(1:nrow(household_data), household_data$bread_winner)
    #     [,1] [,2]
    #[1,]    1    1
    #[2,]    2    1
    #[3,]    3    2
    #[4,]    4    1
    

    This basically says that get 1st value from 1st row, 1st value from 2nd row and so on. This matrix is used to subset the data from the dataframe.