Search code examples
phparraysfrequency-distribution

PHP Generate Array based on Value Frequency


I'm trying to understand the best way to build an ordered array of values based on a frequency in which they should occur. Resulting array could have zero to some repeating characters, based on the characters' frequency and ordering isn't relevant. Here's a breakdown of example data:

Character Frequencies

a => 0.05
b => 0.05
c => 0.1
d => 0.1
e => 0.2
f => 0.5

Result Examples:

['b', 'd', 'a', 'f']
['f', 'f', 'c', 'a']
['e', 'c', 'a', 'f']
['a', 'e', 'f', 'd']

The math certainly isn't accurate here; it's just to demonstrate the previous statements. I'm not concerned with the order of arrays and some may have repeating characters.

Here's a basic loop building the array. The contrived rand() method here is to spare this post all the different outrageous math methods I've tried in effort to keep the question direct and merely conceptual.

$frequencies = [
    'a' => 0.05,
    'b' => 0.05,
    'c' => 0.1,
    'd' => 0.1,
    'e' => 0.2,
    'f' => 0.5
];

$characters = 'abcdef';
$charactersLength = strlen($characters);
$result = [];
for ($i = 0; $i < 4; $i++) {
    // $result[] = $this->getCharacterByFrequency();
    $result[] = $characters[rand(0, $charactersLength - 1)];
}

Solution

  • Be cool to see if anyone has a more efficient way of doing this. I'm sure one exists.

    $frequencies = [
        'a' => 0.05,
        'b' => 0.05,
        'c' => 0.1,
        'd' => 0.1,
        'e' => 0.2,
        'f' => 0.5
    ];
    
    $result = [];
    for ($i = 0; $i < 4; ++$i) {
        $r = mt_rand() / mt_getrandmax();
        foreach ($frequencies as $letter => $frequency) {
          $r -= $frequency;
          if ($r < 0) break;
        }
        $result[] = $letter;
    }
    

    I tested the code with 100000 results and got accurate results.

    array (size=6)
    'a' => float 0.0503105
    'b' => float 0.0496805
    'c' => float 0.099721
    'd' => float 0.100001
    'e' => float 0.201242
    'f' => float 0.499055