Search code examples
rdplyrtidyrdata-cleaningspread

R extremely long to wide with linked rows


I have a df similar to the following:

df <- data.frame(name = c("billy","billy","sarah","sarah","sarah","sarah","sarah","sarah","sarah","linda","linda","linda","linda","linda","linda"),
                 data = c("soccer","8-01-1992","basketball","soccer","10-19-1995","10-25-1995","basketball","11-24-1995",6,"hockey","soccer","basketball","12-21-2001","12-30-2001","1-19-2002"),
                 event = c("joins.project","joins.when","joins.project","joins.project1","joins.when","joins.when1","participation.project","participation.when","participation.repetitions","joins.project","joins.project1","joins.project2","joins.when","joins.when1","joins.when2"))

> df
    name       data                     event
1  billy     soccer             joins.project
2  billy  8-01-1992                joins.when
3  sarah basketball             joins.project
4  sarah     soccer            joins.project1
5  sarah 10-19-1995                joins.when
6  sarah 10-25-1995               joins.when1
7  sarah basketball     participation.project
8  sarah 11-24-1995        participation.when
9  sarah          6 participation.repetitions
10 linda     hockey             joins.project
11 linda     soccer            joins.project1
12 linda basketball            joins.project2
13 linda 12-21-2001                joins.when
14 linda 12-30-2001               joins.when1
15 linda  1-19-2002               joins.when2

My desired output is below.

new_df <- data.frame(name = c("billy", "sarah", "sarah", "sarah", "linda", "linda", "linda"),
                     join_or_particip = c("join", "join", "join", "participate", "join", "join", "join"),
                     sport = c("soccer", "basketball", "soccer", "basketball", "hockey", "soccer", "basketball"),
                     when = c("8-01-1992", "10-19-1995", "10-25-1995", "11-24-1995", "12-21-2001", "12-30-2001", "1-19-2002"),
                     repetitions = c(NA, NA, NA, 6, NA, NA, NA))
new_df
   name join_or_particip      sport       when repetitions
1 billy             join     soccer  8-01-1992          NA
2 sarah             join basketball 10-19-1995          NA
3 sarah             join     soccer 10-25-1995          NA
4 sarah      participate basketball 11-24-1995           6
5 linda             join     hockey 12-21-2001          NA
6 linda             join     soccer 12-30-2001          NA
7 linda             join basketball  1-19-2002          NA

I imagine I need to use spread() to achieve this but I'm not sure how to account for the fact that certain rows in df are linked (i.e., "joins.project1" and "joins.when1" go together). A bit of further context: this is a dataset of people who "joined" or "participated" in between 1 and 200+ "sports" on different dates. If they "participated" there is an additional variable called "repetitions" (i.e., how many times they participated).


Solution

  • dplyr way

    library(dplyr)
    library(tidyr)
    library(stringr)
    
    df %>%
      separate(event, into = c("join_or_particip", "type"), sep = "\\.") %>%
      mutate(num = as.numeric(gsub('[^0-9.-]', '', type)) %>% replace_na(., 0),
             type = gsub('[0-9]+', '', type)) %>%
      pivot_wider(id_cols = c(name, join_or_particip, num), names_from = type, values_from = data) %>%
      select(-num)
    
    
      name  join_or_particip project    when       repetitions
      <chr> <chr>            <chr>      <chr>      <chr>      
    1 billy joins            soccer     8-01-1992  NA         
    2 sarah joins            basketball 10-19-1995 NA         
    3 sarah joins            soccer     10-25-1995 NA         
    4 sarah participation    basketball 11-24-1995 6          
    5 linda joins            hockey     12-21-2001 NA         
    6 linda joins            soccer     12-30-2001 NA         
    7 linda joins            basketball 1-19-2002  NA