Search code examples
rselectdplyrnamesstartswith

How to use tidyr pivot_longer when cols = any column that starts with a certain prefix


Each week, I get a raw dataset from which I need to produce a report. I would like to write one R script that will work every week. Unfortunately, the raw data has slightly different sets of columns each week depending on what specimens were collected. Here's an example of what it looks like week to week.

library(readr)
w1 <- read_csv("species, males, females - fed, females - unfed
a,2,0,3
b,5,7,2
c,8,4,9")
w2 <- read_csv("species, males, females - mixed
a,2,0
b,5,7
c,8,4")

> w1
# A tibble: 3 x 4
  species males `females - fed` `females - unfed`
  <chr>   <dbl>           <dbl>             <dbl>
1 a           2               0                 3
2 b           5               7                 2
3 c           8               4                 9
> w2
# A tibble: 3 x 3
  species males `females - mixed`
  <chr>   <dbl>             <dbl>
1 a           2                 0
2 b           5                 7
3 c           8                 4

Here's how I normally use pivot_longer:

library(tidyr)
w1 %>% pivot_longer(cols = c(males, `females - fed`, `females - unfed`),
             names_to = c("sex","feeding_status"),
             names_sep = " - ",
             values_to = "quantity")

# A tibble: 9 x 4
  species sex     feeding_status quantity
  <chr>   <chr>   <chr>             <dbl>
1 a       males   NA                    2
2 a       females fed                   0
3 a       females unfed                 3
4 b       males   NA                    5
5 b       females fed                   7
6 b       females unfed                 2
7 c       males   NA                    8
8 c       females fed                   4
9 c       females unfed                 9

How can I write code for pivot_longer that will work on w1, w2 and w3? (edited to include w3 -- see below)

I have tried (select(., starts_with("females")) but can't figure out the right syntax. The pivot_longer documentation mentions names_pattern() and extract(), which seem promising but I don't know how to use them. Thank you!

EDIT: in response to akrun's answer, I realized I have to provide slightly more complex sample data. The code also needs to work with a column that occasionally appears in the datasets called "unknown sex", like so:

w3 <- read_csv("species, males, females - mixed, unknown sex
a,2,0,4
b,5,7,0
c,8,4,23")

> w3
# A tibble: 3 x 4
  species males `females - mixed` `unknown sex`
  <chr>   <dbl>             <dbl>         <dbl>
1 a           2                 0             4
2 b           5                 7             0
3 c           8                 4            23

The code suggested below by akrun to address w1 and w2 causes doubled values in the "unknown sex" column of w3:

w3 %>%
  pivot_longer(cols = c(males, starts_with('females')),
               names_to = c("sex", "feeding_status"), names_sep=" - ")

# A tibble: 6 x 5
  species `unknown sex` sex     feeding_status value
  <chr>           <dbl> <chr>   <chr>          <dbl>
1 a                   4 males   NA                 2
2 a                   4 females mixed              0
3 b                   0 males   NA                 5
4 b                   0 females mixed              7
5 c                  23 males   NA                 8
6 c                  23 females mixed              4

Solution

  • The columns that are common are 'species', so we can use -

    library(dplyr)
    library(tidyr)
     w1 %>% 
        pivot_longer(cols = -species, 
           names_to = c("sex","feeding_status"),
              names_sep = " - ",
              values_to = "quantity")
    

    Or one of the select_helpers

    w1 %>%
        pivot_longer(cols = c(males, starts_with('females')),
          names_to = c("sex", "feeding_status"), names_sep=" - ")
    # A tibble: 9 x 4
    #  species sex     feeding_status value
    #  <chr>   <chr>   <chr>          <dbl>
    #1 a       males   <NA>               2
    #2 a       females fed                0
    #3 a       females unfed              3
    #4 b       males   <NA>               5
    #5 b       females fed                7
    #6 b       females unfed              2
    #7 c       males   <NA>               8
    #8 c       females fed                4
    #9 c       females unfed              9
    

    If we want to include multiple cases, then matches in another option

    w3 %>% 
       pivot_longer(cols = c(males, matches('^(females|unknown)')), 
       names_to = c("sex", "feeding_status"), names_sep=" - ")
    # A tibble: 9 x 4
    #  species sex         feeding_status value
    #  <chr>   <chr>       <chr>          <dbl>
    #1 a       males       <NA>               2
    #2 a       females     mixed              0
    #3 a       unknown sex <NA>               4
    #4 b       males       <NA>               5
    #5 b       females     mixed              7
    #6 b       unknown sex <NA>               0
    #7 c       males       <NA>               8
    #8 c       females     mixed              4
    #9 c       unknown sex <NA>              23