Search code examples
rdataframedplyrplyr

Aggregate and collapse a vector based while maintaing order


I've a dataframe as under:

+------+-----+----------+
| from | to  | priority |
+------+-----+----------+
|    1 |   8 |        1 |
|    2 |   6 |        1 |
|    3 |   4 |        1 |
|    4 |   5 |        3 |
|    5 |   6 |        4 |
|    6 |   2 |        5 |
|    7 |   8 |        2 |
|    4 |   3 |        5 |
|    2 |   1 |        1 |
|    6 |   6 |        4 |
|    1 |   7 |        5 |
|    8 |   4 |        6 |
|    9 |   5 |        3 |
+------+-----+----------+

my goal is to group the "to" column based on the from column but in such a way that if a variable is already present in either of the columns, I wouldn't want to take them into consideration further Also, the total priority would be a sum of all the group priorities

So the resulting dataframe would be as under:

+------+------+----------------+
| from |  to  | Total Priority |
+------+------+----------------+
|    1 | 8, 7 |              6 |
|    2 |    6 |              1 |
|    3 |    4 |              1 |
|    9 |    5 |              3 |
+------+------+----------------+

Also, I'd want to keep the same order as of the original table while grouping

I was able to collapse the from column using the "splitstackshape" package as under

library(splitstackshape)
cSplit(df, 'to', sep = ','
+        , direction = 'long')[, .(to = toString(unique(to)))
+                              , by = from]

This does introduce dupicate values I was wondering if there's a way to get the desired result using any other packages


Solution

  • Using DF shown reproducibly in the Note at the end, sort by from giving DF2 and then iterate through its rows removing any row with a duplicate. We need a loop here since each removal depends on the prior ones. Finally summarize the result.

    library(dplyr)
    
    DF2 <- arrange(DF, from)
    
    i <- 1
    while(i <= nrow(DF2)) {
      ix <- seq_len(i-1)
      dup <- with(DF2, (to[i] %in% c(to[ix], from[ix])) | (from[i] %in% to[ix]))
      if (dup) DF2 <- DF2[-i, ] else i <- i + 1
    }
    
    DF2 %>%
      group_by(from) %>%
      summarize(to = toString(to), priority = sum(priority)) %>%
      ungroup
    

    giving:

    # A tibble: 4 x 3
       from to    priority
      <int> <chr>    <int>
    1     1 8, 7         6
    2     2 6            1
    3     3 4            1
    4     9 5            3
    

    Note

    Lines <- "from | to  | priority
       1 |   8 |        1
       2 |   6 |        1
       3 |   4 |        1
       4 |   5 |        3
       5 |   6 |        4
       6 |   2 |        5
       7 |   8 |        2
       4 |   3 |        5
       2 |   1 |        1
       6 |   6 |        4
       1 |   7 |        5
       8 |   4 |        6
       9 |   5 |        3"
    DF <- read.table(text = Lines, header = TRUE, sep = "|", strip.white = TRUE)