Search code examples
pythonnumpyscipystatisticsgamma-distribution

Fit the gamma distribution only to a subset of the samples


I have the histogram of my input data (in black) given in the following graph:

histogram

I'm trying to fit the Gamma distribution but not on the whole data but just to the first curve of the histogram (the first mode). The green plot in the previous graph corresponds to when I fitted the Gamma distribution on all the samples using the following python code which makes use of scipy.stats.gamma:

img = IO.read(input_file)
data = img.flatten() + abs(np.min(img)) + 1

# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1

# data histogram
n, bins, patches = plt.hist(data, 1000, normed=True)

# slice histogram here

# estimation of the parameters of the gamma distribution
fit_alpha, fit_loc, fit_beta = gamma.fit(data, floc=0)
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, fit_loc, fit_beta)
print '(alpha, beta): (%f, %f)' % (fit_alpha, fit_beta)

# plot estimated model
plt.plot(x, y, linewidth=2, color='g')
plt.show()

How can I restrict the fitting only to the interesting subset of this data?

Update1 (slicing):

I sliced the input data by keeping only values below the max of the previous histogram, but the results were not really convincing:

histogram2

This was achieved by inserting the following code below the # slice histogram here comment in the previous code:

max_data = bins[np.argmax(n)]
data = data[data < max_data]

Update2 (scipy.optimize.minimize):

The code below shows how scipy.optimize.minimize() is used to minimize an energy function to find (alpha, beta):

import matplotlib.pyplot as plt
import numpy as np
from geotiff.io import IO
from scipy.stats import gamma
from scipy.optimize import minimize


def truncated_gamma(x, max_data, alpha, beta):
    gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
    norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
    return np.where(x < max_data, gammapdf / norm, 0)


# read image
img = IO.read(input_file)

# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1

# data histogram
n, bins = np.histogram(data, 100, normed=True)

# using minimize on a slice data below max of histogram
max_data = bins[np.argmax(n)]
data = data[data < max_data]

data = np.random.choice(data, 1000)
energy = lambda p: -np.sum(np.log(truncated_gamma(data, max_data, *p)))
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x

# plot data histogram and model
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, 0, fit_beta)
plt.hist(data, 30, normed=True)
plt.plot(x, y, linewidth=2, color='g')
plt.show()

The algorithm above converged for a subset of data, and the output in o was:

x: array([ 16.66912781,   6.88105559])

But as can be seen on the screenshot below, the gamma plot doesn't fit the histogram:

minimize


Solution

  • Here's another possible approach using a manually created dataset in excel that more or less matched the plot given.

    Raw Data

    enter image description here enter image description here

    Outline

    • Imported data into a Pandas dataframe.
    • Mask the indices after the max response index.
    • Create a mirror image of the remaining data.
    • Append the mirror image while leaving a buffer of empty space.
    • Fit the desired distribution to the modified data. Below I do a normal fit by the method of moments and adjust the amplitude and width.

    Working Script

        # Import data to dataframe.
        df = pd.read_csv('sample.csv', header=0, index_col=0)
        # Mask indices after index at max Y.
        mask = df.index.values <= df.Y.argmax()
        df = df.loc[mask, :]
        scaled_y = 100*df.Y.values
    
        # Create new df with mirror image of Y appended.
        sep = 6
        app_zeroes = np.append(scaled_y, np.zeros(sep, dtype=np.float))
        mir_y = np.flipud(scaled_y)
        new_y = np.append(app_zeroes, mir_y)
    
        # Using Scipy-cookbook to fit a normal by method of moments.
        idxs = np.arange(new_y.size)  # idxs=[0, 1, 2,...,len(data)]
        mid_idxs = idxs.mean() # len(data)/2
        # idxs-mid_idxs is [-53.5, -52.5, ..., 52.5, len(data)/2]
        scaling_param = np.sqrt(np.abs(np.sum((idxs-mid_idxs)**2*new_y)/np.sum(new_y)))
    
        # adjust amplitude
        fmax = new_y.max()*1.2 # adjusted function max to 120% max y.
        # adjust width
        scaling_param = scaling_param*.7 # adjusted by 70%.
        # Fit normal.
        fit = lambda t: fmax*np.exp(-(t-mid_idxs)**2/(2*scaling_param**2))
    
        # Plot results.
        plt.plot(new_y, '.')
        plt.plot(fit(idxs), '--')
        plt.show()
    

    Result ![enter image description here

    See the scipy-cookbook fitting data page for more on fitting a normal using method of moments.