Search code examples
rdplyrtidyr

Pivot_longer from multiple columns pairwise


There is a dataset like this

df <- data.frame(
  Pseudonym = c("aa", "bb"),
  KE_date_1 = c("2022-04-01", "2022-04-03"),
  KE_content_2 = c("high pot", "high pot"),
  KE_date_3 = c("2022-08-01", "2022-08-04"),
  KE_content_4 = c("high pot return", "high pot return")
)
Pseudonym | KE_date_1 |   KE_content_2| KE_date_3  |  KE_content_4
--------------------------------------------------------------------
aa        | 2022-04-01|     high pot  | 2022-08-01 | high pot return
bb        | 2022-04-03|     high pot  | 2022-08-04 | high pot return

A column with an identifier (distinct pseudonym) and pairwise data columns, date + text. The data columns have an increasinig suffix. ..1 and ..2 belong togethter, etc.

From this a transformed long dataset is needed, where the pseudonym is not longer distinct.

df2 <- data.frame(
  Pseudonym = c("aa", "aa", "bb", "bb"),
  KE_date = c("2022-04-01", "2022-08-01", "2022-04-03", "2022-08-04"),
  KE_content = c("high pot", "high pot return", "high pot", "high pot return")
)

Pseudonym |  KE_date     |  KE_content
-------------------------------------------
aa        |  2022-04-01  |  high pot
aa        |  2022-08-01  |  high pot return
bb        |  2022-04-03  |  high pot
bb        |  2022-08-04  |  high pot return

Can anyone help? Thank you.


Solution

  • You can use pivot_longer() from tidyr. You can specify the names_pattern argument to capture everything that comes before the underscore and digit at the end of the name. This argument is a regular expression that identifies the unique names of the pivoted variables. In the example below, I use "(.*)_\\d$". In regular expression terms, . matches any character. The asterisk tells the regular expression to continue including characters until one doesn't match the search string. For example, in the string "aabbc", matching "a" would just identify the first character, but matching "a*" would match the first two characters. Matching "." would again identify just the first character, but matching ".*" would match all 5 characters. In names_pattern, you put the pieces of information you want to extract in parentheses. In the regular expression I used, everything before the underscore and digit at the end (the end of a string is marked with a dollar sign) , is captured, so from the variable names "KE_content_1", "KE_date_2", "KE_content_3", "KE_date_4", the regular expression would return "KE_content", "KE_date", "KE_content", "KE_date". Then, using names_to = ".value" will put each of the unique values returned by the regular expression in names_pattern in a different column. For more on how pivot_longer() works, you could look at the pivoting vignette from the tidyr package. For more information on regular expressions, you could google regex tutorial. This one seems pretty good.

    input <- data.frame(
      Pseudonym = c("aa", "bb"),
      KE_date_1 = c("2022-04-01", "2022-04-03"),
      KE_content_2 = c("high pot", "high pot"),
      KE_date_3 = c("2022-08-01", "2022-08-04"),
      KE_content_4 = c("high pot return", "high pot return")
    )
    
    library(dplyr)
    library(tidyr)
    input %>% 
      pivot_longer(starts_with("KE"), 
                   names_pattern="(.*)_\\d$", 
                   names_to = ".value")
    #> # A tibble: 4 × 3
    #>   Pseudonym KE_date    KE_content     
    #>   <chr>     <chr>      <chr>          
    #> 1 aa        2022-04-01 high pot       
    #> 2 aa        2022-08-01 high pot return
    #> 3 bb        2022-04-03 high pot       
    #> 4 bb        2022-08-04 high pot return
    

    Created on 2023-08-28 with reprex v2.0.2