I have a matrix like so:
mat <- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,
2,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,
0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,
0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,1,0,0,1,0,1,1,0,0,1,0,1,
1,1,0,0,0,0,0,0,1,0,1,2,1,0,0,0), nrow=16, ncol=6)
dimnames(mat) <- list(c("a", "c", "f", "h", "i", "j", "l", "m",
"p", "q", "s", "t", "u", "v","x", "z"),
c("1", "2", "3", "4", "5", "6"))
I want to group or bin columns and then aggregate data for each group. Repeat sampling for a bin of size x, n times. This process would be repeated for bin sizes of x+1.
For the first iteration, two random columns are binned. I would like to sample without replacement such that a combination of two columns is not sampled twice (however a column can be used twice if it is paired with a different column). Aggregation is defined as calculating row sums for the binned columns. Aggregated results will be added as a new column in a result matrix for that bin size. The number of columns in the result matrix will be limited to the number of bins randomly sampled.
Bin size continues to get increasingly larger. For the next iteration, the bin size increases to 3 such that 3 randomly selected columns are aggregated. Aggregated data will be put into a different result matrix. This process would continue until the bin is the size of the data frame, in which case resampling is impossible. All result matrices would be put into a list of matrices.
Below is the expected result resultList
for the first two bin sizes given the matrix above.
# Bin size =2
# The randomly sampled columns are columns 1&2, 2&3, 3&4, 4&5, 5&6.
mat1 <- matrix(c(3,0,0,0,1,0,1,0,0,0,0,0,0,0,2,0,
2,0,1,1,2,0,0,0,0,0,0,0,0,0,1,0,
0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,
0,1,0,0,1,1,0,1,0,1,1,0,0,1,0,1,
1,1,0,0,1,0,0,1,1,1,2,2,1,1,0,1), nrow=16)
dimnames(mat1) <- list(c("a", "c", "f", "h", "i", "j", "l", "m",
"p", "q", "s", "t", "u", "v","x", "z"),
c("1_2", "2_3", "3_4", "4_5", "5_6"))
# Bin size= 3
# The randomly selected columns to be joined are columns 1,2&3,
# 2,3&4, 3,4&5, 4,5&6.
mat2 <- matrix(c(3,0,1,1,2,0,1,0,0,0,0,0,0,0,3,0,
2,1,1,1,2,1,0,0,0,0,0,0,0,0,1,0,
0,1,1,1,2,1,0,1,0,1,1,0,0,1,0,1,
1,2,0,0,1,1,0,1,1,1,2,2,1,1,0,1), nrow=16)
dimnames(mat2) <- list(c("a", "c", "f", "h", "i", "j", "l", "m",
"p", "q", "s", "t", "u", "v","x", "z"),
c("1_2_3", "2_3_4", "3_4_5", "4_5_6"))
resultList <- list(mat1, mat2)
I have posted a similar question for an alternative binning technique here: Bin columns and aggregate data via random sample with replacement for iteratively larger bin sizes
Here is my attempt at binning randomly selected columns and putting results for each bin size into a list of matrices. I attempted to select j
random columns using sample
, do rowSums
and remove those selected j
paired columns so that those are not repeated in the next iteration:
lapply(seq_len(ncol(mat) - 1), function(j)
do.call(cbind,
lapply(sample(ncol(mat) - j, size= ), function(i)
rowSums(mat[, i:(i - j)]))))
Based on how many columns you want in your final output, we can modify the approach but currently this gives all possible combination.
#Get column names of the matrices
all_cols <- colnames(mat)
#Select bin value from 2:ncol(mat)
total_out <- lapply(seq_len(ncol(mat))[-1], function(j) {
#Create all combinations taking j items at a time
temp <- combn(all_cols, j, function(x)
#Take rowSums for the current combination
#Also paste column names to assign column names later
list(rowSums(mat[, x]), paste0(x, collapse = "_")), simplify = FALSE)
#Combine rowSums matrix
new_mat <- sapply(temp, `[[`, 1)
#Assign column names
colnames(new_mat) <- sapply(temp, `[[`, 2)
#Return new matrix
new_mat
})
The current output looks like
total_out
#[[1]]
# 1_2 1_3 1_4 1_5 1_6 2_3 2_4 2_5 2_6 3_4 3_5 3_6 4_5 4_6 5_6
#a 3 1 1 1 2 2 2 2 3 0 0 1 0 1 1
#c 0 0 1 0 1 0 1 0 1 1 0 1 1 2 1
#f 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0
#h 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0
#i 1 1 0 1 0 2 1 2 1 1 2 1 1 0 1
#j 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0
#l 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
#m 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1
#p 0 0 0 0 1 0 0 0 1 0 0 1 0 1 1
#q 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1
#s 0 0 0 1 1 0 0 1 1 0 1 1 1 1 2
#t 0 0 0 0 2 0 0 0 2 0 0 2 0 2 2
#u 0 0 0 0 1 0 0 0 1 0 0 1 0 1 1
#v 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1
#x 3 2 2 2 2 1 1 1 1 0 0 0 0 0 0
#z 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1
#...
#....
#....
#[[5]]
# 1_2_3_4_5_6
#a 4
#c 2
#f 1
#h 1
#i 3
#j 1
#l 1
#m 1
#p 1
#q 1
#s 2
#t 2
#u 1
#v 1
#x 3
#z 1
Note that, there are total 5 (ncol - 1
) matrices in total_out
with number of columns as
length(total_out)
#[1] 5
sapply(total_out, ncol)
#[1] 15 20 15 6 1
Since, we know that the last element in the list is going to be a one-column matrix we can remove them and select random nc/2
columns from the remaining matrix.
total_out <- total_out[-length(total_out)]
lapply(total_out, function(x) {
nc <- ncol(x)
x[, sample(nc, ceiling(nc/2))]
})