How to create a column that calculates the difference between two columns for a series of rows that have matching values for two other columns

I have been trying to use dplyr, strigr, and grepl to create a new column that calculates the difference between a column in a dataset that has paired columns with matched values in two column variables and different values for other two columns, one of which is the one of interest in which I would like to calculate the difference.

The matching values for the sets of rows are Date and Time, and the unmatched values for the sets of rows are Context and IndividualID.

The new column that I want to create would apply to the difference between IndividualID in the sets of rows that have the matching values abovementioned (Date, Time).

Here how the dataset looks like,

Date        Time  Context    SeriesNumber IndividualID
01/01/2000  8.00  Mixed       1           a, b, c, d, e, f
01/01/2000  8.00  NON_Mixed   1           d, e, f, g, h
01/01/2000  9.00  Mixed       2           c, d, e, f
01/01/2000  9.00  NON_Mixed   2           a, b, c, d
01/01/2000  9.00  NON_Mixed   2           e, f, g, h

So the new column would be IndividualID_Difference, and would ONLY be calculated for the NON_Mixed contexts, for the mixed it should stay the same.

Date        Time  Context   SeriesNumber  IndividualID       IndividualID_Difference
01/01/2000  8.00  Mixed      1            a, b, c, d, e, f   a, b, c, d, e, f  
01/01/2000  8.00  NON_Mixed  1            d, e, f, g, h      g, h
01/01/2000  9.00  Mixed      2            c, d, e, f         c, d, e, f
01/01/2000  9.00  NON_Mixed  2            a, b, c, d         a, b
01/01/2000  9.00  NON_Mixed  2            e, f, g, h         g, h

Second row, d, e, f are removed because they are present in the matched row (first row with same Time, Date and SeriesNumber, but different Context) that is the Mixed context, they are repeated in the NonMixed, that´s why they need to be removed.

In the fourth row, a, b are removed because they are present in the third row, that is the matched row with same Time, Date and SeriesNumber and different Context.

In the fifth row, e, f are removed because they are present in the third row, that is the matched row with same Time, Date and SeriesNumber and different Context.

The list of IndividualIDs is rather immense.

thanks in case you knew how to come up with a solution for this tricky question!

Solution

I hope that there is a less convoluted way to achieve this but I could could not work out how to do it any other way. As we do not have a dput() of your data, I have assumed your IndividualID column is a character type. Therefore, this method involves converting your IndividualID to a list type column, and creating a list type column of the "mixed" IndividualID for each group. The difference between these two columns is then compared to return the difference.

This approach also assumes the "mixed" value is always the first row of each group. If both of my assumptions are incorrect, let me know and I will update my answer.

library(dplyr)
library(stringr)

# Your sample data
df <- read.table(text = "Date;Time;Context;SeriesNumber;IndividualID
01/01/2000;8.00;Mixed;1;a, b, c, d, e, f
01/01/2000;8.00;NON_Mixed;1;d, e, f, g, h
01/01/2000;9.00;Mixed;2;c, d, e, f
01/01/2000;9.00;NON_Mixed;2;a, b, c, d
01/01/2000;9.00;NON_Mixed;2;e, f, g, h", sep = ";", header = TRUE)

# Add list versions of both IndividualID and the "mixed" variable of 
# IndividualID to new temp columns
df1 <- df %>%
  rowwise() %>%
  mutate(temp1 = list(strsplit(IndividualID,", "))) %>%
  group_by(SeriesNumber) %>%
  mutate(temp2 = list(strsplit(IndividualID[1],", "))) %>%
  ungroup() 

# Function to compare new list columns and return desired results
unmatched_chars <- function(x, y, z) {
  ind_chars <- str_split(x, ", ")[[1]]
  res_chars <- setdiff(ind_chars, unlist(z))
  paste(res_chars, collapse = ", ")
}

# Use unmatched_chars function to create IndividualID_Difference
df1 <- df1 %>%
  rowwise() %>%
  mutate(IndividualID_Difference = unmatched_chars(IndividualID, 
                                                   temp1,
                                                   temp2),
         IndividualID_Difference = ifelse(IndividualID_Difference == "", 
                                          IndividualID, 
                                          IndividualID_Difference)) %>%
  select(-starts_with("temp"))

data.frame(df1)
        Date Time   Context SeriesNumber     IndividualID IndividualID_Difference
1 01/01/2000    8     Mixed            1 a, b, c, d, e, f        a, b, c, d, e, f
2 01/01/2000    8 NON_Mixed            1    d, e, f, g, h                    g, h
3 01/01/2000    9     Mixed            2       c, d, e, f              c, d, e, f
4 01/01/2000    9 NON_Mixed            2       a, b, c, d                    a, b
5 01/01/2000    9 NON_Mixed            2       e, f, g, h                    g, h