Search code examples
rmathematical-optimization

Randomizing balanced experimental designs


I am writing some code to generate balanced experimental designs for market research, specifically for use in conjoint analysis and maximum difference scaling.

The first step is to generate a Partially Balanced Incompleted Block (PBIB) design. This is straight-forward using the R package AlgDesign.

For most types of research such a design would be sufficient. However, in market research one wants to control for order effects in each block. This is where I would appreciate some help.

Create test data

# The following code is not essential in understanding the problem, 
# but I provide it in case you are curious about the origin of the data itself.
#library(AlgDesign)
#set.seed(12345)
#choices <- 4
#nAttributes <- 7
#blocksize <- 7
#bsize <- rep(choices, blocksize)
#PBIB <- optBlock(~., withinData=factor(1:nAttributes), blocksizes=bsize)
#df <- data.frame(t(array(PBIB$rows, dim=c(choices, blocksize))))
#colnames(df) <- paste("Item", 1:choices, sep="")
#rownames(df) <- paste("Set", 1:nAttributes, sep="")

df <- structure(list(
  Item1 = c(1, 2, 1, 3, 1, 1, 2), 
  Item2 = c(4, 4, 2, 5, 3, 2, 3), 
  Item3 = c(5, 6, 5, 6, 4, 3, 4), 
  Item4 = c(7, 7, 6, 7, 6, 7, 5)), 
  .Names = c("Item1", "Item2", "Item3", "Item4"), 
  row.names = c("Set1", "Set2", "Set3", "Set4", "Set5", "Set6", "Set7"), 
  class = "data.frame")

** Define two helper functions

balanceMatrix calculates the balance of the matrix:

balanceMatrix <- function(x){
    t(sapply(unique(unlist(x)), function(i)colSums(x==i)))
}

balanceScore calculates a metric of 'fit' - lower scores are better, with zero perfect:

balanceScore <- function(x){
    sum((1-x)^2)
}

Define a function that resamples the rows at random

findBalance <- function(x, nrepeat=100){
    df <- x
    minw <- Inf
    for (n in 1:nrepeat){
        for (i in 1:nrow(x)){df[i,] <- sample(df[i, ])}
        w <- balanceMatrix(df)
        sumw <- balanceScore(w)
        if(sumw < minw){
            dfbest <- df
            minw <- sumw
        }
    }
    dfbest
}

Main code

The dataframe df is a balanced design of 7 sets. Each set will display 4 items to the respondent. The numeric values in df refers to 7 different attributes. For example, in Set1 respondent will be asked to choose his/her preferred option from attributes 1, 3, 4 and 7.

The ordering of items in each set is conceptually not important. Thus an ordering of (1,4,5,7) is identical to (7,5,4,1).

However, to get a fully balanced design, each attribute will appear an equal number of times in each column. This design is there imbalanced, since attribute 1 appears 4 times in column 1:

df

     Item1 Item2 Item3 Item4
Set1     1     4     5     7
Set2     2     4     6     7
Set3     1     2     5     6
Set4     3     5     6     7
Set5     1     3     4     6
Set6     1     2     3     7
Set7     2     3     4     5

To try and find a more balanced design, I wrote the function findBalance. This conducts a random search for better solutions, by randomly sampling across the rows of df. With 100 repeats it finds the following best solution:

set.seed(12345)
dfbest <- findBalance(df, nrepeat=100)
dfbest

     Item1 Item2 Item3 Item4
Set1     7     5     1     4
Set2     6     7     4     2
Set3     2     1     5     6
Set4     5     6     7     3
Set5     3     1     6     4
Set6     7     2     3     1
Set7     4     3     2     5

This appears more balanced, and the calculated balance matrix contains lots of ones. The balance matrix counts the number of times each attribute appears in each column. For example, the following table indicates (in the top left hand cell) that attribute 1 appears twice not at all in column 1, and twice in column 2:

balanceMatrix(dfbest)

     Item1 Item2 Item3 Item4
[1,]     0     2     1     1
[2,]     1     1     1     1
[3,]     1     1     1     1
[4,]     1     0     1     2
[5,]     1     1     1     1
[6,]     1     1     1     1
[7,]     2     1     1     0

The balance score for this solution is 6, indicating there are at least six cells unequal to 1:

balanceScore(balanceMatrix(dfbest))
[1] 6

My question

Thank you for following this detailed example. My question is how can I rewrite this search function to be more systematic? I'd like to tell R to:

  • Minimize balanceScore(df)
  • By changing row order of df
  • Subject to: already fully constrained

Solution

  • OK, I somehow misunderstood your question. So bye bye Fedorov, hello applied Fedorov.

    The following algorithm is based on the second iteration of the Fedorov algorithm :

    1. calculate all possible permutation for every set, and store them in the C0 list
    2. draw a first possible solution from the C0 space (one permutation for every set). This can be the original one, but as I need the indices, I rather just start at random.
    3. calculate the score for every new solution, where the first set is replaced by all permutations.
    4. replace the first set with the permutation giving the lowest score
    5. repeat 3-4 for every other set
    6. repeat 3-5 until score reaches 0 or for n iterations.

    Optionally, you can restart the procedure after 10 iterations and start from another starting point. In you test case, it turned out that few starting points converged very slowly to 0. The function below found balanced experimental designs with a score of 0 in on average 1.5 seconds on my computer :

    > X <- findOptimalDesign(df)
    > balanceScore(balanceMatrix(X))
    [1] 0
    > mean(replicate(20, system.time(X <- findOptimalDesign(df))[3]))
    [1] 1.733
    

    So this is the function now (given your original balanceMatrix and balanceScore functions) :

    findOptimalDesign <- function(x,iter=4,restart=T){
        stopifnot(require(combinat))
        # transform rows to list
        sets <- unlist(apply(x,1,list),recursive=F)
        nsets <- NROW(x)
        # C0 contains all possible design points
        C0 <- lapply(sets,permn)
        n <- gamma(NCOL(x)+1)
    
        # starting point
        id <- sample(1:n,nsets)
        Sol <- sapply(1:nsets,function(i)C0[[i]][id[i]])
    
        IT <- iter
        # other iterations
        while(IT > 0){
          for(i in 1:nsets){
              nn <- 1:n
              scores <- sapply(nn,function(p){
                 tmp <- Sol
                 tmp[[i]] <- C0[[i]][[p]]
                 w <- balanceMatrix(do.call(rbind,tmp))
                 balanceScore(w)
              })
              idnew <- nn[which.min(scores)]
              Sol[[i]] <- C0[[i]][[idnew]]
    
          }
          #Check if score is 0
          out <- as.data.frame(do.call(rbind,Sol))
          score <- balanceScore(balanceMatrix(out))
          if (score==0) {break}
          IT <- IT - 1
    
          # If asked, restart
          if(IT==0 & restart){
              id <- sample(1:n,nsets)
              Sol <- sapply(1:nsets,function(i)C0[[i]][id[i]])
              IT <- iter
          }
        }
        out
    }
    

    HTH

    EDIT : fixed small bug (it restarted immediately after every round as I forgot to condition on IT). Doing that, it runs a bit faster still.