Search code examples
algorithmlanguage-agnosticsequencesrandom

Algorithm to generate a sequence proportional to specified percentage


Given a Map of objects and designated proportions (let's say they add up to 100 to make it easy):

val ss : Map[String,Double] = Map("A"->42, "B"->32, "C"->26)

How can I generate a sequence such that for a subset of size n there are ~42% "A"s, ~32% "B"s and ~26% "C"s? (Obviously, small n will have larger errors).

(Work language is Scala, but I'm just asking for the algorithm.)

UPDATE: I resisted a random approach since, for instance, there's ~16% chance that the sequence would start with AA and ~11% chance it would start with BB and there would be very low odds that for n precisely == (sum of proportions) the distribution would be perfect. So, following @MvG's answer, I implemented as follows:

/**
Returns the key whose achieved proportions are most below desired proportions
*/
def next[T](proportions : Map[T, Double], achievedToDate : Map[T,Double]) : T = {
    val proportionsSum = proportions.values.sum
    val desiredPercentages = proportions.mapValues(v => v / proportionsSum)
    //Initially no achieved percentages, so avoid / 0 
    val toDateTotal = if(achievedToDate.values.sum == 0.0){
        1
    }else{
        achievedToDate.values.sum
    }
    val achievedPercentages = achievedToDate.mapValues(v => v / toDateTotal)
    val gaps = achievedPercentages.map{ case (k, v) =>
        val gap = desiredPercentages(k) - v
        (k -> gap)
    }
    val maxUnder = gaps.values.toList.sortWith(_ > _).head
    //println("Max gap is " + maxUnder)
    val gapsForMaxUnder = gaps.mapValues{v => Math.abs(v - maxUnder) < Double.Epsilon }
    val keysByHasMaxUnder = gapsForMaxUnder.map(_.swap)
    keysByHasMaxUnder(true)
}

/**
Stream of most-fair next element 
*/
def proportionalStream[T](proportions : Map[T, Double], toDate : Map[T, Double]) : Stream[T] = {
    val nextS = next(proportions, toDate)
    val tailToDate = toDate + (nextS -> (toDate(nextS) + 1.0))
    Stream.cons(
        nextS,
        proportionalStream(proportions, tailToDate)
    )
}

That when used, e.g., :

val ss : Map[String,Double] = Map("A"->42, "B"->32, "C"->26)
val none : Map[String,Double] = ss.mapValues(_ => 0.0)
val mySequence = (proportionalStream(ss, none) take 100).toList
println("Desired : " + ss)
println("Achieved : " + mySequence.groupBy(identity).mapValues(_.size))
mySequence.map(s => print(s))
println

produces :

Desired : Map(A -> 42.0, B -> 32.0, C -> 26.0)
Achieved : Map(C -> 26, A -> 42, B -> 32)
ABCABCABACBACABACBABACABCABACBACABABCABACABCABACBA
CABABCABACBACABACBABACABCABACBACABABCABACABCABACBA

Solution

  • For a deterministic approach, the most obvious solution would probably be this:

    • Keep track of the number of occurrences of each item in the sequence so far.
    • For the next item, choose that item for which the difference between intended and actual count (or proportion, if you prefer that) is maximal, but only if the intended count (resp. proportion) is greater than the actual one.
    • If there is a tie, break it in an arbitrary but deterministic way, e.g. choosing the alphabetically lowest item.

    This approach would ensure an optimal adherence to the prescribed ratio for every prefix of the infinite sequence generated in this way.

    Quick & dirty python proof of concept (don't expect any of the variable “names” to make any sense):

    import sys
    
    p = [0.42, 0.32, 0.26]
    c = [0, 0, 0]
    a = ['A', 'B', 'C']
    n = 0
    
    while n < 70*5:
        n += 1
        x = 0
        s = n*p[0] - c[0]
        for i in [1, 2]:
            si = n*p[i] - c[i]
            if si > s:
                x = i
                s = si
        sys.stdout.write(a[x])
        if n % 70 == 0:
            sys.stdout.write('\n')
        c[x] += 1
    

    Generates

    ABCABCABACABACBABCAABCABACBACABACBABCABACABACBACBAABCABCABACABACBABCAB
    ACABACBACABACBABCABACABACBACBAABCABCABACABACBABCAABCABACBACABACBABCABA
    CABACBACBAABCABCABACABACBABCABACABACBACBAACBABCABACABACBACBAABCABCABAC
    ABACBABCABACABACBACBAACBABCABACABACBACBAABCABCABACABACBABCABACABACBACB
    AACBABCABACABACBACBAABCABCABACABACBABCAABCABACBACBAACBABCABACABACBACBA