Search code examples
runique

Finding unique tuples in R but ignoring order


Since my data is much more complicated, I made a smaller sample dataset (I left the reshape in to show how I generated the data).

set.seed(7)
x = rep(seq(2010,2014,1), each=4)
y = rep(seq(1,4,1), 5)
z = matrix(replicate(5, sample(c("A", "B", "C", "D"))))
temp_df = cbind.data.frame(x,y,z)
colnames(temp_df) = c("Year", "Rank", "ID")
head(temp_df)
require(reshape2)
dcast(temp_df, Year ~ Rank)

which results in...

> dcast(temp_df, Year ~ Rank)
Using ID as value column: use value.var to override.
  Year 1 2 3 4
1 2010 D B A C
2 2011 A C D B
3 2012 A B D C
4 2013 D A C B
5 2014 C A B D

Now I essentially want to use a function like unique, but ignoring order to find where the first 3 elements are unique.

Thus in this case:

I would have A,B,C in row 5

I would have A,B,D in rows 1&3

I would have A,C,D in rows 2&4

Also I need counts of these "unique" events

Also 2 more things. First, my values are strings, and I need to leave them as strings. Second, if possible, I would have a column between year and 1 called Weighting, and then when counting these unique combinations I would include each's weighting. This isn't as important because all weightings will be small positive integer values, so I can potentially duplicate the rows earlier to account for weighting, and then tabulate unique pairs.


Solution

  • You could do something like this:

    df <- dcast(temp_df, Year ~ Rank)
    
    combos <- apply(df[, 2:4], 1, function(x) paste0(sort(x), collapse = ""))
    
    combos
    #     1     2     3     4     5 
    # "BCD" "ABC" "ACD" "BCD" "ABC" 
    

    For each row of the data frame, the values in columns 1, 2, and 3 (as labeled in the post) are sorted using sort, then concatenated using paste0. Since order doesn't matter, this ensures that identical cases are labeled consistently.

    Note that the paste0 function is equivalent to paste(..., sep = ""). The collapse argument says to concatenate the values of a vector into a single string, with vector values separated by the value passed to collapse. In this case, we're setting collapse = "", which means there will be no separation between values, resulting in "ABC", "ACD", etc.

    Then you can get the count of each combination using table:

    table(combos)
    # ABC ACD BCD 
    #   2   1   2