Search code examples
tidyrdata-wrangling

Advanced pivot_longer transformation sequentially on a group of columns


I'm a little perplex concerning the exact way to proceed with this wrangling procedure.

I've a dataset which consist in raters that are assessing lung sounds (S1,...,S40). For each sound the assessed some quality of the recording (_A, _B, _X).

The dataframe looks like this bd example (I gave an example with 3 raters, scoring 2 sounds for 2 different characteristics A and B). You can see that each row represents 1 rater results for the different sounds (S_1, S_2) and sounds characteristics (_A, _B).

bd <- data.frame(Operator=c("Rater1", "Rater2", "Rater3"),
                    S1_A=c("Good", "Good", "Bad"), #quality assessment
                    S1_B=c(3,2,3), #confidence in assessment 1-5 likert scale          
                    S2_A=c("Don't know", "Bad", "Don't know"),
                    S2_B=c(4,4,2))

I'd like to get a result like that:

Operator Sound A B
Rater1 1 Good 3
Rater1 2 Don't know 4
Rater2 1 Good 2
Rater2 2 Bad 4
Rater3 1 Bad 3
Rater3 2 Don't know 2

I'm not able to design adequately the code to first create a column identifying the sound and then adding the sounds characteristics.

Thanks in advance for your help!!!


Solution

  • You can use names_pattern = with a regex to achieve your desired outcome. In this example:

    • names_to = c("Sound", ".value") creates a column to store the original column names (Sound), and .value is used to create a column for each suffix e.g. A and B.
    • names_pattern = "S(\\d+)_(A|B)" defines what values are carried over from the original column names to the new columns and column names. The brackets () in the regex are used to capture what values to carry over. The first capture corresponds to the Sound column, and captures a one or more digits using \\d+ but only if it is after a literal "S". For the .value columns, the second capture looks for either A or B after a literal underscore.

    Note that this works for your sample data, but you mention _A, _B, and _X as suffixes. To include _X (or any other suffixes), simply add them to the second capture e.g. names_pattern = "S(\\d+)_(A|B|X)".

    library(tidyr)
      
    bd |>
      pivot_longer(cols = -Operator,
                   names_to = c("Sound", ".value"),
                   names_pattern = "S(\\d+)_(A|B)")
    
    # # A tibble: 6 × 4
    #   Operator Sound   A              B
    #   <chr>    <chr>   <chr>      <dbl>
    # 1 Rater1   1       Good           3
    # 2 Rater1   2       Don't know     4
    # 3 Rater2   1       Good           2
    # 4 Rater2   2       Bad            4
    # 5 Rater3   1       Bad            3
    # 6 Rater3   2       Don't know     2