Search code examples
rdataframesapply

Extract certain columns from data frame R


My data frame looks like this:

   x s1   s2 s3  s4
1 x1  1 1954  1 yes
2 x2  2 1955  1  no
3 x3  1 1976  2 yes
4 x4  2 1954  2 yes
5 x5  3 1943  1  no

Sample data:

df <- data.frame(x=c('x1','x2','x3','x4','x5'),
                    s1=c(1,2,1,2,3),
                    s2=c(1954,1955,1976,1954,1943), 
                    s3=c(1,1,2,2,1),
                    s4=c('yes','no','yes','yes','no'))```

Is it possible to extract the data frame's columns containing integers 1 to 3? For example, the new data frame would look like:

 newdf
   x s1 s3
1 x1  1  1
2 x2  2  1
3 x3  1  2
4 x4  2  2
5 x5  3  1

Is it possible to change the s1 and s3 columns to 0 or 1 depending on whether or not the value in the column is 1? The altered data frame would then look like:

newdf2   
 x s1 s3
1 x1  1  1
2 x2  0  1
3 x3  1  0
4 x4  0  0
5 x5  0  1

Solution

  • base R

    newdf <- df[, unique(c("x", names(which(sapply(df, function(z) is.numeric(z) & any(c(1, 3) %in% z)))))), drop = FALSE]
    newdf
    #    x s1 s3
    # 1 x1  1  1
    # 2 x2  2  1
    # 3 x3  1  2
    # 4 x4  2  2
    # 5 x5  3  1
    
    newdf[-1] <- lapply(newdf[-1], function(z) +(z == 1))
    newdf
    #    x s1 s3
    # 1 x1  1  1
    # 2 x2  0  1
    # 3 x3  1  0
    # 4 x4  0  0
    # 5 x5  0  1
    

    Walk-through:

    • first, we determine which columns are numbers and contain the numbers 1 or 3:

      sapply(df, function(z) is.numeric(z) & any(c(1, 3) %in% z))
      #     x    s1    s2    s3    s4 
      # FALSE  TRUE FALSE  TRUE FALSE 
      

      This will exclude any column that is not numeric, meaning that a character column that contains a literal "1" or "3" will not be retained. This is complete inference on my end; if you want to accept the string versions then remove the is.numeric(z) component.

    • second, we extract the names of those that are true, and prepend "x"

      c("x", names(which(sapply(df, function(z) is.numeric(z) & any(c(1, 3) %in% z)))))
      # [1] "x"  "s1" "s3"
      
    • wrap that in unique(.) if, for some reason, "x" is also numeric and contains 1 or 3 (this step is purely defensive, you may not strictly need it)

    • select those columns, defensively adding drop=FALSE so that if only one column is matched, it still returns a full data.frame

    • replace just those columns (excluding the first column which is "x") with 0 or 1; the z == 1 returns logical, and the wrapping +(..) converts logical to 0 (false) or 1 (true).

    dplyr

    library(dplyr)
    df %>%
      select(x, where(~ is.numeric(.) & any(c(1, 3) %in% .))) %>%
      mutate(across(-x, ~ +(. == 1)))
    #    x s1 s3
    # 1 x1  1  1
    # 2 x2  0  1
    # 3 x3  1  0
    # 4 x4  0  0
    # 5 x5  0  1