Search code examples
rtidyversepivot-tabletidyrtabular

How to transform and pivot a rather messy table in R using tidyverse?


I have a table with approximately 2000 entries that contains names, positions, field of expertise, and addresses of professors. The table is quite messy, and I’m struggling to find a programmatic way to transform and pivot it into a tidy format.

My goal is to create a tidy table that includes at least the following columns: names, positions, field of expertise, and email address.

(one recurring pattern throughout the data is that each person is separated by a NA from the next one.)


Here is a sample of the data:

data field
Person 1, MD A
Associate Professor of A A
Program Associate, A A
Senior Medical Director, FGP Operations A
UMMG Ambulatory Surgery A
Residency Program A
Core Educational Lead A
[email protected] A
NA A
Person 2, MD B
Person 2 B
Clinical Assistant Professor of B B
Medical Student Clerkship and Core Educational Lead B
[email protected] B
NA B
labcoat B
Person 3, MD B
Clinical Assistant Professor of B B
[email protected] B
NA B
labcoat B
Person 4, MD B
Professor of B B
Professor of B B
[email protected] B
NA B
Person 5, MD C
Person 5 C
Professor of C C
Professor of C C
Associate Chair, Quality C
Department of Urology and Service Chief C
[email protected] C
132-547-1321 C
NA C

Here is the tibble code (to reproduce):

tibble::tribble(
                                                  ~data, ~field,
                                         "Person 1, MD",    "A",
                             "Associate Professor of A",    "A",
                                 "Program Associate, A",    "A",
              "Senior Medical Director, FGP Operations",    "A",
                              "UMMG Ambulatory Surgery",    "A",
                                    "Residency Program",    "A",
                                "Core Educational Lead",    "A",
                                     "[email protected]",    "A",
                                                     NA,    "A",
                                         "Person 2, MD",    "B",
                                             "Person 2",    "B",
                    "Clinical Assistant Professor of B",    "B",
  "Medical Student Clerkship and Core Educational Lead",    "B",
                                     "[email protected]",    "B",
                                                     NA,    "B",
                                              "labcoat",    "B",
                                         "Person 3, MD",    "B",
                    "Clinical Assistant Professor of B",    "B",
                                     "[email protected]",    "B",
                                                     NA,    "B",
                                              "labcoat",    "B",
                                         "Person 4, MD",    "B",
                                       "Professor of B",    "B",
                                       "Professor of B",    "B",
                                     "[email protected]",    "B",
                                                     NA,    "B",
                                         "Person 5, MD",    "C",
                                             "Person 5",    "C",
                                       "Professor of C",    "C",
                                       "Professor of C",    "C",
                             "Associate Chair, Quality",    "C",
              "Department of Urology and Service Chief",    "C",
                                     "[email protected]",    "C",
                                         "132-547-1321",    "C",
                                                     NA,    "C"
  )

Solution

  • Answered before your comment regarding each person being separated by NA so it does not retain any rows after an email address (but you implied this was not important in your question anyway). Here's a stepwise approach. Highlight up to each pipe "|>" and run to see how each step works. Comment below if you need further explanation.

    library(dplyr)
    library(tidyr)
    library(stringr)
    
    df_wide <- df |>
      mutate(names = case_when(str_detect(data, ", MD$") ~ "name",
                               str_detect(data, "@") ~ "email",
                               .default = NA),
             data = str_remove(data, ", MD$"),
             tmp_start = cumsum(!is.na(names) & names == "name"),
             tmp_end = lag(cumsum(!is.na(names) & names == "email"), default = 0),
             id = if_else(tmp_start == tmp_end, NA, tmp_start)) |>
      group_by(id) |>
      fill(id, .direction = "downup") |>
      select(-starts_with("tmp")) |>
      filter(!is.na(id)) |>
      filter(!duplicated(data, fromFirst = TRUE)) |>
      mutate(names = if_else(is.na(names), paste0("position", 1:n() - 1), names)) |>
      pivot_wider(names_from = names,
                  values_from = data) |>
      ungroup() |>
      select(-id)
    
    data.frame(df_wide)
    # field     name                         position1                                           position2                               position3               position4         position5             position6            email
    #     A Person 1          Associate Professor of A                                Program Associate, A Senior Medical Director, FGP Operations UMMG Ambulatory Surgery Residency Program Core Educational Lead [email protected]
    #     B Person 2 Clinical Assistant Professor of B Medical Student Clerkship and Core Educational Lead                                    <NA>                    <NA>              <NA>                  <NA> [email protected]
    #     B Person 3 Clinical Assistant Professor of B                                                <NA>                                    <NA>                    <NA>              <NA>                  <NA> [email protected]
    #     B Person 4                    Professor of B                                                <NA>                                    <NA>                    <NA>              <NA>                  <NA> [email protected]
    #     C Person 5                    Professor of C                            Associate Chair, Quality Department of Urology and Service Chief                    <NA>              <NA>                  <NA> [email protected]