Search code examples
rgroup-bycharactersubset

Combine character variable over rows and columns by group in R


I am a beginner in R and I am trying to solve a problem in R, which is I guess quite easy for experienced users.

The problem is the following: Customers (A, B, C) are coming in repeatedly using different programms (Prg). I would like to identify "typical sequences" of programs. Therefore, I identify the first programm, they consume, the second, and the third. In a next step, I would like to combine these information to sequences of programms by customer. For a customer first consuming Prg1, then Prg2, then Prg3, the final outcome should be "Prg1-Prg2-Prg3".

The code below produces a dataframe similar to the one I have. Prg is the Programm in the respective year, First is the first year the customer enters, Sec the second and Third the third.

The code produces columns that extract the program consumed in the first contract (Code_1_Prg), second contract (Code_2_Prg) and third contract (Code_3_Prg).

Unfortunately, I am not successful combining these 3 columns to the required goal. I tried to group by ID and save the frist element of the sequence in a new column called "chain1". Here I get the error message "Error in df %>% group_by(ID) %>% df$chain1 = df[df$Code_1_Prg != "NA", : could not find function "%>%<-", even though I am using the magrittr and dplyr packages.

detach(package:plyr)
library(dplyr)
library(magrittr)
df %>%
  group_by(ID)  %>%
  df$chain1 = df[df$Code_1_Prg!="NA", "Code_1_Prg"]

Below, I share some code, which produces the dataframe and the starting point for extracting the character variable in Code_1_Prg by group.

I would be really grateful, if you could help me with this. Thank you very much in advance!

df <- data.frame("ID"=c("A","A","A","A","B", "B", "B","B","B","C","C", "C", "C","C","C","C"),
                 "Year_Contract" =c("2010", "2015",  "2017","2017","2010","2010", "2015","2015","2020","2015","2015","2017","2017","2017","2018","2018"),
                 "Prg"=c("AIB","AIB","LLA","LLA","BBU","BBU", "KLU","KLU","DDI","CKN","CKN","BBU","BBU","BBU","KLU","KLU"),
                 "First"=c("2010","2010","2010","2010","2010","2010", "2010","2010","2010","2015","2015","2015","2015","2015","2015","2015"),
                  "Sec"=c("2015","2015","2015","2015","2015","2015", "2015","2015","2015","2017","2017","2017","2017","2017","2017","2017"),
                  "Third"=c("2017","2017","2017","2017","2020","2020", "2020","2020","2020","2018","2018","2018","2018","2018","2018","2018")
                 )


df$Code_1_Prg <- ifelse(df$Year_Contract == df$First, df$Code_1_Prg <- df$Prg, NA)
df$Code_2_Prg <- ifelse(df$Year_Contract == df$Sec, df$Code_2_Prg <- df$Prg, NA)
df$Code_3_Prg <- ifelse(df$Year_Contract == df$Third, df$Code_3_Prg <- df$Prg, NA)


detach(package:plyr)
library(dplyr)
library(magrittr)
df %>%
  group_by(ID)  %>%
  df$chain1 = df[df$Code_1_Prg!="NA", "Code_1_Prg"]



#This is the final column, I am trying to create
df2 <- data.frame("ID"=c("A","B", "C"),
              "Goal" =c("AIB-LLA", "BBU-KLU-DDI",  "CKN-BBU-KLU")
                                  
                  )
df <- merge(df, df2, by="ID")

Solution

  • Are you looking for something like this?

    libra4ry(dplyr)
    df %>% 
      group_by(ID) %>% 
      arrange(Year_Contract, .by_group = TRUE) %>% 
      distinct() %>% 
      summarise(sequence = toString(Prg))
    
      ID    sequence     
      <chr> <chr>        
    1 A     AIB, AIB, LLA
    2 B     BBU, KLU, DDI
    3 C     CKN, BBU, KLU