I have done some searching but I cannot seem to be able to find a reasonable way to sample from a truncated normal distribution.
Without truncation I was doing:
samples = [np.random.normal(loc=x,scale=d) for (x,d) in zip(X,D)]
X
and D
being lists of floats.
Currently I am implementing truncation as such:
def truncnorm(loc,scale,bounds):
s = np.random.normal(loc,scale)
if s > bounds[1]:
return bounds[1]
elif s < bounds[0]:
return bounds[0]
return s
samples = [truncnorm(loc=x,scale=d,bounds=b) for (x,d,b) in zip(X,D,bounds)]
bounds
being a list of tuples (min,max)
This approach feels a little awkward, so I'm wondering if there is a better way?
Returning the value of the bounds for samples outside them, will result in too many samples falling on the bounds. This is not representative of the actual distribution. The values on the bounds need to be rejected and replaced by a new sample. Such code could be:
def test_truncnorm(loc, scale, bounds):
while True:
s = np.random.normal(loc, scale)
if bounds[0] <= s <= bounds[1]:
break
return s
This can be extremely slow given narrow bounds. Scipy's truncnorm handles such cases more efficiently. A bit surprisingly, the bounds are expressed in function of the standard normal, so your call would be:
s = scipy.stats.truncnorm.rvs((bounds[0]-loc)/scale, (bounds[1]-loc)/scale, loc=loc, scale=scale)
Note that scipy works much faster when making use of numpy's vectorization and broadcasting. And once you're used to the notation, it also looks simpler to write and read. All samples can be calculated in one go as:
X = np.array(X)
D = np.array(D)
bounds = np.array(bounds)
samples = scipy.stats.truncnorm.rvs((bounds[:, 0] - X) / D, (bounds[:, 1] - X) / D, loc=X, scale=D)