There is a dataset like this
df <- data.frame(
Pseudonym = c("aa", "bb"),
KE_date_1 = c("2022-04-01", "2022-04-03"),
KE_content_2 = c("high pot", "high pot"),
KE_date_3 = c("2022-08-01", "2022-08-04"),
KE_content_4 = c("high pot return", "high pot return")
)
Pseudonym | KE_date_1 | KE_content_2| KE_date_3 | KE_content_4
--------------------------------------------------------------------
aa | 2022-04-01| high pot | 2022-08-01 | high pot return
bb | 2022-04-03| high pot | 2022-08-04 | high pot return
A column with an identifier (distinct pseudonym) and pairwise data columns, date + text. The data columns have an increasinig suffix. ..1 and ..2 belong togethter, etc.
From this a transformed long dataset is needed, where the pseudonym is not longer distinct.
df2 <- data.frame(
Pseudonym = c("aa", "aa", "bb", "bb"),
KE_date = c("2022-04-01", "2022-08-01", "2022-04-03", "2022-08-04"),
KE_content = c("high pot", "high pot return", "high pot", "high pot return")
)
Pseudonym | KE_date | KE_content
-------------------------------------------
aa | 2022-04-01 | high pot
aa | 2022-08-01 | high pot return
bb | 2022-04-03 | high pot
bb | 2022-08-04 | high pot return
Can anyone help? Thank you.
You can use pivot_longer()
from tidyr
. You can specify the names_pattern
argument to capture everything that comes before the underscore and digit at the end of the name. This argument is a regular expression that identifies the unique names of the pivoted variables. In the example below, I use "(.*)_\\d$"
. In regular expression terms, .
matches any character. The asterisk tells the regular expression to continue including characters until one doesn't match the search string. For example, in the string "aabbc"
, matching "a" would just identify the first character, but matching "a*" would match the first two characters. Matching "." would again identify just the first character, but matching ".*"
would match all 5 characters. In names_pattern
, you put the pieces of information you want to extract in parentheses. In the regular expression I used, everything before the underscore and digit at the end (the end of a string is marked with a dollar sign) , is captured, so from the variable names "KE_content_1", "KE_date_2", "KE_content_3", "KE_date_4"
, the regular expression would return "KE_content", "KE_date", "KE_content", "KE_date"
. Then, using names_to = ".value"
will put each of the unique values returned by the regular expression in names_pattern
in a different column. For more on how pivot_longer()
works, you could look at the pivoting vignette from the tidyr
package. For more information on regular expressions, you could google regex tutorial. This one seems pretty good.
input <- data.frame(
Pseudonym = c("aa", "bb"),
KE_date_1 = c("2022-04-01", "2022-04-03"),
KE_content_2 = c("high pot", "high pot"),
KE_date_3 = c("2022-08-01", "2022-08-04"),
KE_content_4 = c("high pot return", "high pot return")
)
library(dplyr)
library(tidyr)
input %>%
pivot_longer(starts_with("KE"),
names_pattern="(.*)_\\d$",
names_to = ".value")
#> # A tibble: 4 × 3
#> Pseudonym KE_date KE_content
#> <chr> <chr> <chr>
#> 1 aa 2022-04-01 high pot
#> 2 aa 2022-08-01 high pot return
#> 3 bb 2022-04-03 high pot
#> 4 bb 2022-08-04 high pot return
Created on 2023-08-28 with reprex v2.0.2