I've been asked in the job for a function for R that uses parallelization and, when you pass it a dataframe as the one added below, it should look for combinations that added up equals to the value in Importe_Pendiente
for each row. The found combinations should be placed in a new column called "Combinations" and its value will be the IDs of the rows that, when summed, result in the same Importe_Pendiente
value for that row. The value of Importe_Pendiente
could be either a positive or negative. I'm pretty new in this world so I hope you can help me! Thank you very much.
The target dataframe is like this:
datos <- data.frame(
ID = c("FCR23E015-1625", "E015-23-3583", "E015-23-3584", "E015-23-3585", "FCR23NIEB-0141"),
Proveedor = c("2192", "6772", "6772", "6772", "7403"),
Descripcion = c("Factura FCR23E015-1625", "AMAZON BUSINESS EU, S.A.R.L.", "AMAZON BUSINESS EU, S.A.R.L.", "AMAZON BUSINESS EU, S.A.R.L.", "Factura FCR23NIEB-0141"),
Importe = c(-2330, 54.8, 54.8, 66, -1029),
Importe_Pendiente = c(-2330, 45, 55, 100, -1029)
)
And the output should be something like this:
ID Proveedor Descripcion Importe Importe_Pendiente Combinaciones
<chr> <chr> <chr> <dbl> <dbl> <chr>
1 FCR23E015-1625 2192 Factura FCR23E015-1625 -2330 -2330 NO_COMBINACIONES
2 E015-23-3583 6772 AMAZON BUSINESS EU, S.A.R.L. 54.8 45 NO_COMBINACIONES
3 E015-23-3584 6772 AMAZON BUSINESS EU, S.A.R.L. 54.8 55 NO_COMBINACIONES
4 E015-23-3585 6772 AMAZON BUSINESS EU, S.A.R.L. 66 100 [E015-23-3583+E015-23-3584]
5 FCR23NIEB-0141 7403 Factura FCR23NIEB-0141 -1028 -1028 NO_COMBINACIONES
In case that some rows have more than 1 combination possible it would be nice that the combinations are split with brackets [].
I've spent a lot of time trying to modify this script which I took from chat-gpt but I wasn't lucky enough to make it work.
plan(multisession)
buscar_combinaciones <- function(df) {
# Casting the column Importe_Pendiente to numeric type
df$Importe_Pendiente <- as.numeric(df$Importe_Pendiente)
resultado <- df %>%
group_by(Proveedor) %>%
mutate(Combinaciones = {
if (n() < 2) {
"NO_COMBINACIONES"
} else {
idx_comb <- combn(n(), 2)
combinations_str <- vector("character", ncol(idx_comb))
for (i in seq_along(combinations_str)) {
combination <- Importe_Pendiente[idx_comb[, i]]
if (sum(combination) == 0) {
combinations_str[i] <- paste(ID[idx_comb[, i]], collapse = "+")
}
}
valid_combinations <- na.omit(combinations_str)
if (length(valid_combinations) == 0) {
"NO_COMBINACIONES"
} else {
paste(valid_combinations, collapse = ", ")
}
}
}) %>%
ungroup()
return(resultado)
}
# Adjusting the number of cores of the CPU to be used while parallelizing
# Only multiple cores will be used when there are more than 1 row for each Proveedor
num_nucleos <- ifelse(length(unique(datos$Proveedor)) > 1, 4, 1)
plan(multisession, workers = num_nucleos)
Probably you can try the code below
datos %>%
mutate(Combinationes = {
if (n() < 2) {
"NO_COMBINACIONES"
} else {
s <- unlist(
lapply(
2:n(),
\(k) {
combn(1:n(),
k,
FUN = \(...) setNames(
sum(Importe_Pendiente[...]),
sprintf("[%s]", paste0(ID[...], collapse = "+"))
),
simplify = FALSE
)
}
)
)
v <- tapply(names(s), s, toString)
replace_na(
v[match(as.character(Importe_Pendiente), names(v))],
"NO_COMBINACIONES"
)
}
}, .by = Proveedor)
which gives
ID Proveedor Descripcion Importe
1 FCR23E015-1625 2192 Factura FCR23E015-1625 -2330.0
2 E015-23-3583 6772 AMAZON BUSINESS EU, S.A.R.L. 54.8
3 E015-23-3584 6772 AMAZON BUSINESS EU, S.A.R.L. 54.8
4 E015-23-3585 6772 AMAZON BUSINESS EU, S.A.R.L. 66.0
5 FCR23NIEB-0141 7403 Factura FCR23NIEB-0141 -1029.0
Importe_Pendiente Combinationes
1 -2330 NO_COMBINACIONES
2 45 NO_COMBINACIONES
3 55 NO_COMBINACIONES
4 100 [E015-23-3583+E015-23-3584]
5 -1029 NO_COMBINACIONES