Search code examples
rtidyr

How to pivot longer a longitudinal dataset


I have a longitudinal dataset which I created by merging different datasets using the personal identifier column.The dataset columns are in the order personal identifier, a_sex, a_countryofbirth, a_health, a_educationstatus, b_sex, b_countryofbirth, b_health, b_educationstatus, c_sex, c_countryofbirth, c_health, c_educationstatus etc all the way to l. All variables starting with a_ represented the first wave, variables starting with b represented the second wave and so on-

I am trying to create a new variable called Wave using Pivot longer so that my table looks like: -

Table: InPreg_transformed

**

 - Person ID    Wave  Sex CountryofBirth Health

**

I used this code among others but it did work.

InPreg_transformed<- InPregDF %>%

  pivot_longer(cols = contains("_"),

               names_to = c("_value", "Wave"),

               names_pattern = "(_+)"

The other code I used: -

InPreg_transformed<- InPreg %>%

          pivot_longer(cols = contains("."), names_to = c(".value", 
          "Wave"), names_pattern = "(.+).(.+)")

 summary(InPreg_transformed)

Please assist


Solution

  • Just to be sure that I understood correctly, I created a random nonsense example with n "individuals".

    First load the librairies:

    library(tibble)
    library(dplyr)
    library(tidyr)
    

    Then create the dataset:

    sex <- c("Male", "Female")
    europe <- c("Belarus", "Belgium", "Bulgaria",
                "Croatia", "CzechRepublic", "Estonia", "France", 
                "Germany", "Hungary", "Ireland", "Italia", "Latvia", "Lithuania", 
                "Luxembourg", "Netherlands", "Poland", "Portugal", "Romania", 
                "Slovakia", "Slovenia", "Spain")
    health <- c("Excellent", "Good", "Fair", "Poor")
    education <- c("High School", "Bachelor's", "Master's", "PhD")
    n <- 10
    
    wdat <- tibble(
      ID = sprintf("Ind%02i", 1:n), # IDs
      a_sex = sample(sex, n, replace = TRUE),
      a_countryofbirth = sample(europe, n, replace = TRUE),
      a_health = sample(health, n, replace = TRUE),
      a_educationstatus = sample(education, 10, replace = TRUE),
      b_sex = sample(sex, n, replace = TRUE),
      b_countryofbirth = sample(europe, n, replace = TRUE),
      b_health = sample(health, n, replace = TRUE),
      b_educationstatus = sample(education, 10, replace = TRUE),
      c_sex = sample(sex, n, replace = TRUE),
      c_countryofbirth = sample(europe, n, replace = TRUE),
      c_health = sample(health, n, replace = TRUE),
      c_educationstatus = sample(education, 10, replace = TRUE))
    

    The data wdat contains a unique ID for each individual and then three blocks of columns.

    With this data, the syntax to transform it to a "long" format can be done with the pivot_longer function like so

    wdat %>% 
      pivot_longer(
        -ID,
        names_to = c("wave", ".value"),
        names_pattern = "(.)_(.*)"
        ) 
    

    where

    • names_pattern = "(.)_(.*)" means that there are two important pieces of information in the column name, a single character first and a string, separated with _,
    • names_to = c("wave", ".value") means that the single character will go in a column called wave, and the value in the wide columns will go in columns whose names are based on the common pattern, e.g., all the values in a_sex, b_sex and c_sex will go in a column called sex

    EDIT: using names_sep is much easier in this case

    wdat %>% 
      pivot_longer(
        -ID,
        names_sep = "_",
        names_to = c("wave", ".value")
      )