Search code examples
rcombinationscategorical-data

Count combinations of categorical variables, regardless of order, in R?


Thanks for any help! I have a dataframe in R with two columns of categorical variables, like so:

rowA <- c("Square", "Circle", "Triangle", "Square", "Circle", "Triangle", "Square", "Circle", "Triangle")

rowB <- c("Circle", "Square", "Square", "Square", "Circle", "Circle", "Triangle", "Triangle", "Triangle")

df1 <- data.frame(rowA, rowB)

print(df1)

When we print it, it looks like this:

      rowA     rowB
1   Square   Circle
2   Circle   Square
3 Triangle   Square
4   Square   Square
5   Circle   Circle
6 Triangle   Circle
7   Square Triangle
8   Circle Triangle
9 Triangle Triangle

I want to count the frequency of each combination of categories in rowA and rowB. Here's what I'm hung up on -- the combinations are reversible, meaning "Square - Circle" is the same as "Circle - Square" for our purposes, and we want them to be summed together. The ideal output would look like this:

Pair             Count
Square - Circle      2
Square - Triangle    2
Square - Square      1
Circle - Triangle    2
Circle - Circle      1
Triangle - Triangle  1

I'd be thrilled if anybody had any advice, thanks!

Edit: Post got flagged as a duplicate question, but I don't agree that the suggested posts adequately answered my question (hence I asked in the first place, after a lot of digging). Really appreciate the unique and easy answers here.


Solution

  • We could rearrrange by row with pmin/pmax and count

    library(dplyr)
    library(stringr)
    df1 %>%
         count(Pair = str_c(pmin(rowA, rowB), ' - ',
           pmax(rowA, rowB)), name = "Count")
    

    -output

                 Pair   Count
    1     Circle - Circle 1
    2     Circle - Square 2
    3   Circle - Triangle 2
    4     Square - Square 1
    5   Square - Triangle 2
    6 Triangle - Triangle 1