Search code examples
pythonnumpyscipymontecarlobeta-distribution

Efficiently simulating many frequency-severity distributions over thousands of iterations in numpy


I've got a problem at work that goes as follows:

We have, say, 1 million possible events which define frequency-severity distributions. For each event we have an annual rate which defines a Poisson distribution, and alpha and beta parameters for a Beta distributions. The goal is to simulate in the order of >100,000 "years", each year being defined as getting a frequency N for each event, and getting N samples of the relative beta distribution.

The cumbersome fact for me is how can I efficiently get N_i ~ Poisson(lambda_i) samples from Beta distribution Beta_i while also making sure I can attribute them to the correct year?

In terms of output I'll need to look at both the maximum and total value of the samples per year, so temporarily I'm just storing it as an array of dictionaries (not intended to be the output format)

years = 5000

rng = np.random.default_rng()
losses = []
for year in range(years):
  occurences = rng.poisson(data['RATE'])
  annual_losses = []
  for idx, occs in enumerate(occurences):
    if occs > 0:
      event = data.iloc[idx]
      for occ in range(occs):
        loss = rng.beta(event['alpha'], event['beta']) * event['ExpValue']
        annual_losses.append(loss)
  annual_losses.append(0)
  losses.append({'year': year, 'losses': annual_losses})

I've tried to follow Optimising Python/Numpy code used for Simulation however I can't seem to understand how to vectorise this code efficiently.

Changes I've made before posting here are (times for 5000 years):

  • swapping from scipy to numpy (72s -> 66s)
  • calculating frequencies for all years in once outside the loop (66s -> 73s... oops)

Ideally I'd like this to run as fast, or for as many iterations, as possible, and I was also running into issues with memory using scipy previously.

EDIT:

Upon request here's a version that just calculates the maximum loss for each year

years = 20_000

rng = np.random.default_rng()
largest_losses = np.zeros(shape=years)
for year in range(years):
  occurences = rng.poisson(data['RATE'])
  largest_loss = 0
  for idx, event_occs in enumerate(occurences):
    if event_occs > 0:
      event = data.iloc[idx]
      for event_occ in range(event_occs):
        loss = rng.beta(event['alpha'], event['beta']) * event['ExpValue']
        if loss > largest_loss:
          largest_loss = loss
  largest_losses[year] = largest_loss

For testing purposes the events have a total rate of ~0.997 and the timed tests above are from an event pool of 100,604 events.

To specify the goal, I'd like to see if I can calculate the percentiles of the losses, e.g. the 1-in-250 loss

np.percentile(largest_losses, (1 - 1/250) * 100)

which currently aren't accurate at 5,000, hence wanting a process that can run in a few minutes or less for ~100,000 years.

EDIT 2: Reproduceable values (as np arrays):

rate_scale = 4.81e-4
rate_shape = 2.06e-2
rates = rng.gamma(rate_shape, rate_scale, size=nevents)

alpha_scale = 1.25
alpha_shape = 0.439
alphas = rng.gamma(alpha_shape, alpha_scale, size=nevents)

beta_scale = 289
beta_shape = 0.346
betas = rng.gamma(beta_shape, beta_scale, size=nevents)

value_scale = 4.2e8
value_shape = 1.0
values = rng.gamma(value_shape, value_scale, size=nevents)

and have amended my current code to use numpy arrays

years = 20_000

rng = np.random.default_rng()
largest_losses = np.zeros(shape=years)
for year in range(years):
  occurences = rng.poisson(rates)
  largest_loss = 0
  for idx, event_occs in enumerate(occurences):
    for event_occ in range(event_occs):
      loss = rng.beta(alphas[idx], betas[idx]) * values[idx]
      if loss > largest_loss:
        largest_loss = loss
  largest_losses[year] = largest_loss

f'{np.percentile(largest_losses, (1 - 1/250) * 100):,.0f}'

Solution

  • I profiled @WarrenWeckesser's answer, and I found that it spends 93% of its time generating poisson-distributed numbers using the rate array.

    Here's a line profile of what I'm talking about:

    Timer unit: 1e-09 s
    
    Total time: 3.57874 s
    File: /tmp/ipykernel_2774/1125349843.py
    Function: simulate_warren_answer at line 1
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
         1                                           def simulate_warren_answer(nyears, rate, alpha, beta, value, rng):
         2         1      13092.0  13092.0      0.0      max_loss = np.zeros(nyears, dtype=np.float64)
         3         1       3150.0   3150.0      0.0      total_loss = np.zeros(nyears, dtype=np.float64)
         4      1001     347152.0    346.8      0.0      for year in range(nyears):
         5      1000 3355879564.0    3e+06     93.8          n = rng.poisson(rate)
         6      1000  179118381.0 179118.4      5.0          k = np.nonzero(n)[0]
         7      1000     911231.0    911.2      0.0          if len(k) > 0:
         8       654     935876.0   1431.0      0.0              nonzero_n = n[k]
         9       654    4049698.0   6192.2      0.1              a = np.repeat(alpha[k], nonzero_n)
        10       654    2645836.0   4045.6      0.1              b = np.repeat(beta[k], nonzero_n)
        11       654    2550944.0   3900.5      0.1              v = np.repeat(value[k], nonzero_n)
        12       654   26183513.0  40036.0      0.7              losses = rng.beta(a, b)*v
        13       654    3557136.0   5439.0      0.1              max_loss[year] = losses.max()
        14       654    2543208.0   3888.7      0.1              total_loss[year] = losses.sum()
        15         1        303.0    303.0      0.0      return max_loss, total_loss
    

    For this reason, I started looking into whether it was possible to speed up poisson-distributed random number generation. I started by looking at the lengths of n and k, and realized that for the rates that the asker is interested in, k is much much smaller than n, and most entries in n are zero.

    If there is a 98% chance that the poisson-distributed number will be zero, I can simulate this by generating a uniformly distributed random number between [0, 1), and comparing it to 0.02. This leaves me to handle the other case 2% of the time, where the number will be 1+.

    I then realized that I could generalize this to generating any value of the poisson function, by comparing the random uniform value to the CDF. (In the code below, I am using the survival function, which is 1-CDF. This is for accuracy reasons; numbers near 0 are represented more accurately than numbers near 1.)

    This code generates a lookup table for converting uniformly-distributed numbers into poisson-distributed numbers according to each rate. Because the rates are small, most of the time, the code will only need to examine the first entry for each rate, and the result will be zero. It also does the k = np.nonzero(n)[0] step inside this loop, as it is more cache efficient.

    Because the rates are not changing, the lookup table only needs to be build once, and can be re-used between years.

    import scipy
    import numpy as np
    import numba as nb
    
    
    def simulate_new_rng(nyears, rate, alpha, beta, value, rng):
        max_loss = np.empty(nyears, dtype=np.float64)
        total_loss = np.zeros(nyears, dtype=np.float64)
        lookup = build_lookup(rate)
        for year in range(nyears):
            uniform = rng.random(len(rate))
            n, nonzeros = uniform_to_poisson_plus_nonzero(uniform, lookup)
            n_nonzeros = n[nonzeros]
            a = np.repeat(alpha[nonzeros], n_nonzeros)
            b = np.repeat(beta[nonzeros], n_nonzeros)
            v = np.repeat(value[nonzeros], n_nonzeros)
            losses = rng.beta(a, b)*v
            max_loss[year] = losses.max(initial=0)
            total_loss[year] = losses.sum()
        return max_loss, total_loss
    
    
    def build_lookup(rate, ptol=1e-15):
        """Build uniform -> poisson lookup table based on survival function
        
        Does not include an entry for k if the probability of the most likely
        rate to be k is less than ptol."""
        ret = []
        for k in itertools.count():
            sf = scipy.stats.poisson.sf(k, rate)
            # include first row of lookup table no matter what
            if sf.max() < ptol and k != 0:
                break
            ret.append(sf)
        return np.array(ret)
    
    
    @nb.njit()
    def uniform_to_poisson_plus_nonzero(numbers, lookup):
        assert numbers.shape[0] == lookup.shape[1]
        nonzeros = []
        ret = np.empty(lookup.shape[1], dtype=np.int64)
        for j in range(lookup.shape[1]):
            number = numbers[j]
            for i in range(lookup.shape[0]):
                if number < lookup[i, j]:
                    continue
                else:
                    break
            else:
                raise Exception("failed to find match for survival function - ptol is too big")
            ret[j] = i
            if i != 0:
                nonzeros.append(j)
        return ret, np.array(nonzeros, dtype=np.int64)
    

    Benchmark result:

    Original code: 24.4 s ± 308 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    Warrens answer: 3.44 s ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    My answer: 648 ms ± 17.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Appendix: Full benchmarking code

    import matplotlib.pyplot as plt
    import itertools
    import scipy
    import numba as nb
    import numpy as np
    import pandas as pd
    
    
    rng = np.random.default_rng()
    nevents = 100604
    nyears = 1000
    
    rate_scale = 4.81e-4
    rate_shape = 2.06e-2
    rates = rng.gamma(rate_shape, rate_scale, size=nevents)
    
    alpha_scale = 1.25
    alpha_shape = 0.439
    alphas = rng.gamma(alpha_shape, alpha_scale, size=nevents)
    
    beta_scale = 289
    beta_shape = 0.346
    betas = rng.gamma(beta_shape, beta_scale, size=nevents)
    
    value_scale = 4.2e8
    value_shape = 1.0
    values = rng.gamma(value_shape, value_scale, size=nevents)
    
    
    def simulate_orig_answer(nyears, rate, alpha, beta, value, rng):
        data = pd.DataFrame({'RATE': rate, 'alpha': alpha, 'beta': beta, 'ExpValue': value})
        rng = np.random.default_rng()
        largest_losses = np.zeros(shape=nyears)
        total_loss = np.zeros(nyears, dtype=np.float64)
        for year in range(nyears):
            occurences = rng.poisson(data['RATE'])
            largest_loss = 0
            sum_loss = 0
            for idx, event_occs in enumerate(occurences):
                if event_occs > 0:
                    event = data.iloc[idx]
                    for event_occ in range(event_occs):
                        loss = rng.beta(event['alpha'], event['beta']) * event['ExpValue']
                        if loss > largest_loss:
                            largest_loss = loss
                        sum_loss += loss
            largest_losses[year] = largest_loss
            total_loss[year] = sum_loss
        return largest_losses, total_loss
    
    
    def simulate_warren_answer(nyears, rate, alpha, beta, value, rng):
        max_loss = np.zeros(nyears, dtype=np.float64)
        total_loss = np.zeros(nyears, dtype=np.float64)
        for year in range(nyears):
            n = rng.poisson(rate)
            k = np.nonzero(n)[0]
            if len(k) > 0:
                nonzero_n = n[k]
                a = np.repeat(alpha[k], nonzero_n)
                b = np.repeat(beta[k], nonzero_n)
                v = np.repeat(value[k], nonzero_n)
                losses = rng.beta(a, b)*v
                max_loss[year] = losses.max()
                total_loss[year] = losses.sum()
        return max_loss, total_loss
    
    
    def simulate_new_rng(nyears, rate, alpha, beta, value, rng):
        max_loss = np.empty(nyears, dtype=np.float64)
        total_loss = np.zeros(nyears, dtype=np.float64)
        lookup = build_lookup(rate)
        for year in range(nyears):
            uniform = rng.random(len(rate))
            n, nonzeros = uniform_to_poisson_plus_nonzero(uniform, lookup)
            n_nonzeros = n[nonzeros]
            a = np.repeat(alpha[nonzeros], n_nonzeros)
            b = np.repeat(beta[nonzeros], n_nonzeros)
            v = np.repeat(value[nonzeros], n_nonzeros)
            losses = rng.beta(a, b)*v
            max_loss[year] = losses.max(initial=0)
            total_loss[year] = losses.sum()
        return max_loss, total_loss
    
    
    def build_lookup(rate, ptol=1e-15):
        """Build uniform -> poisson lookup table based on survival function
        
        Does not include an entry for k if the probability of the most likely
        rate to be k is less than ptol."""
        ret = []
        for k in itertools.count():
            sf = scipy.stats.poisson.sf(k, rate)
            # include first row of lookup table no matter what
            if sf.max() < ptol and k != 0:
                break
            ret.append(sf)
        return np.array(ret)
    
    
    @nb.njit()
    def uniform_to_poisson_plus_nonzero(numbers, lookup):
        assert numbers.shape[0] == lookup.shape[1]
        nonzeros = []
        ret = np.empty(lookup.shape[1], dtype=np.int64)
        for j in range(lookup.shape[1]):
            number = numbers[j]
            for i in range(lookup.shape[0]):
                if number < lookup[i, j]:
                    continue
                else:
                    break
            else:
                raise Exception("failed to find match for survival function - ptol is too big")
            ret[j] = i
            if i != 0:
                nonzeros.append(j)
        return ret, np.array(nonzeros, dtype=np.int64)
    
    
    %timeit simulate_orig_answer(nyears, rates, alphas, betas, values, rng)
    %timeit simulate_warren_answer(nyears, rates, alphas, betas, values, rng)
    %timeit simulate_new_rng(nyears, rates, alphas, betas, values, rng)