Search code examples
rdplyrparallel-processingcombinationseconomics

How to find combinations that sum up to a certain value in each row of a dataframe


I've been asked in the job for a function for R that uses parallelization and, when you pass it a dataframe as the one added below, it should look for combinations that added up equals to the value in Importe_Pendiente for each row. The found combinations should be placed in a new column called "Combinations" and its value will be the IDs of the rows that, when summed, result in the same Importe_Pendiente value for that row. The value of Importe_Pendiente could be either a positive or negative. I'm pretty new in this world so I hope you can help me! Thank you very much.

The target dataframe is like this:

datos <- data.frame(
  ID = c("FCR23E015-1625", "E015-23-3583", "E015-23-3584", "E015-23-3585", "FCR23NIEB-0141"),
  Proveedor = c("2192", "6772", "6772", "6772", "7403"),
  Descripcion = c("Factura FCR23E015-1625", "AMAZON BUSINESS EU, S.A.R.L.", "AMAZON BUSINESS EU, S.A.R.L.", "AMAZON BUSINESS EU, S.A.R.L.", "Factura FCR23NIEB-0141"),
  Importe = c(-2330, 54.8, 54.8, 66, -1029),
  Importe_Pendiente = c(-2330, 45, 55, 100, -1029)
)

And the output should be something like this:

 ID             Proveedor Descripcion                  Importe Importe_Pendiente Combinaciones     
  <chr>          <chr>     <chr>                          <dbl>             <dbl> <chr>             
1 FCR23E015-1625 2192      Factura FCR23E015-1625       -2330             -2330 NO_COMBINACIONES   
2 E015-23-3583   6772      AMAZON BUSINESS EU, S.A.R.L.    54.8              45 NO_COMBINACIONES      
3 E015-23-3584   6772      AMAZON BUSINESS EU, S.A.R.L.    54.8              55 NO_COMBINACIONES      
4 E015-23-3585   6772      AMAZON BUSINESS EU, S.A.R.L.    66               100 [E015-23-3583+E015-23-3584]
5 FCR23NIEB-0141 7403      Factura FCR23NIEB-0141       -1028             -1028 NO_COMBINACIONES

In case that some rows have more than 1 combination possible it would be nice that the combinations are split with brackets [].

I've spent a lot of time trying to modify this script which I took from chat-gpt but I wasn't lucky enough to make it work.

plan(multisession)

buscar_combinaciones <- function(df) {
  # Casting the column Importe_Pendiente to numeric type
  df$Importe_Pendiente <- as.numeric(df$Importe_Pendiente)
  
  resultado <- df %>%
    group_by(Proveedor) %>%
    mutate(Combinaciones = {
      if (n() < 2) {
        "NO_COMBINACIONES"
      } else {
        idx_comb <- combn(n(), 2)
        combinations_str <- vector("character", ncol(idx_comb))
        
        for (i in seq_along(combinations_str)) {
          combination <- Importe_Pendiente[idx_comb[, i]]
          if (sum(combination) == 0) {
            combinations_str[i] <- paste(ID[idx_comb[, i]], collapse = "+")
          }
        }
        
        valid_combinations <- na.omit(combinations_str)
        
        if (length(valid_combinations) == 0) {
          "NO_COMBINACIONES"
        } else {
          paste(valid_combinations, collapse = ", ")
        }
      }
    }) %>%
    ungroup()
  
  return(resultado)
}

# Adjusting the number of cores of the CPU to be used while parallelizing
# Only multiple cores will be used when there are more than 1 row for each Proveedor
num_nucleos <- ifelse(length(unique(datos$Proveedor)) > 1, 4, 1)
plan(multisession, workers = num_nucleos)

Solution

  • Probably you can try the code below

    datos %>%
        mutate(Combinationes = {
            if (n() < 2) {
                "NO_COMBINACIONES"
            } else {
                s <- unlist(
                    lapply(
                        2:n(),
                        \(k) {
                            combn(1:n(),
                                k,
                                FUN = \(...) setNames(
                                    sum(Importe_Pendiente[...]),
                                    sprintf("[%s]", paste0(ID[...], collapse = "+"))
                                ),
                                simplify = FALSE
                            )
                        }
                    )
                )
                v <- tapply(names(s), s, toString)
                replace_na(
                    v[match(as.character(Importe_Pendiente), names(v))],
                    "NO_COMBINACIONES"
                )
            }
        }, .by = Proveedor)
    

    which gives

                  ID Proveedor                  Descripcion Importe
    1 FCR23E015-1625      2192       Factura FCR23E015-1625 -2330.0
    2   E015-23-3583      6772 AMAZON BUSINESS EU, S.A.R.L.    54.8
    3   E015-23-3584      6772 AMAZON BUSINESS EU, S.A.R.L.    54.8
    4   E015-23-3585      6772 AMAZON BUSINESS EU, S.A.R.L.    66.0
    5 FCR23NIEB-0141      7403       Factura FCR23NIEB-0141 -1029.0
      Importe_Pendiente               Combinationes
    1             -2330            NO_COMBINACIONES
    2                45            NO_COMBINACIONES
    3                55            NO_COMBINACIONES
    4               100 [E015-23-3583+E015-23-3584]
    5             -1029            NO_COMBINACIONES