Search code examples
pythonrmathstatisticspower-law

How to properly fit data to a power law in Python?


I am considering the number of occurrence of unique words in the Moby Dick novel and using the powerlaw python package to fit words’ frequencies to a power law.

I am not sure why I can't recapitulate the results from previous work by Clauset et al. as both the p-value and the KS score are "bad".

The idea is to fit the frequencies of unique words into a power law. However, the Kolmogorov-Smirnov tests for goodness of fit calculated by scipy.stats.kstest look terrible.

I have the following function to fit data to a power law:

import numpy as np
import powerlaw
import scipy
from scipy import stats

def fit_x(x):
    fit = powerlaw.Fit(x, discrete=True)
    alpha = fit.power_law.alpha
    xmin  = fit.power_law.xmin
    print('powerlaw', scipy.stats.kstest(x, "powerlaw", args=(alpha, xmin), N=len(x)))
    print('lognorm', scipy.stats.kstest(x, "lognorm", args=(np.mean(x), np.std(x)), N=len(x)))

Downloading the frequency of unique words in the novel Moby Dick by Herman Melville (supposed to follow a power law according to Aaron Clauset et al.):

wget http://tuvalu.santafe.edu/~aaronc/powerlaws/data/words.txt

Python script:

x =  np.loadtxt('./words.txt')
fit_x(x)

results:

('powerlaw', KstestResult(statistic=0.862264651286131, pvalue=0.0))
('log norm', KstestResult(statistic=0.9910368602492707, pvalue=0.0))

When I compare the expected results and follow this R tutorial on the same Moby Dick dataset I get a decent p-value and KS test value:

library("poweRlaw")
data("moby", package="poweRlaw")
m_pl = displ$new(moby)
est = estimate_xmin(m_pl)
m_pl$setXmin(est)
bs_p = bootstrap_p(m_pl)
bs_p$p
## [1] 0.6738

What am I missing when computing the KS test values and postprocessing the fit by the powerlaw python library? The PDF and CDF look ok to me, but the KS tests look awry.

enter image description here


Solution

  • It is still not clear to me how to determine significance and goodness of fit by using the scipy.stats.kstest with the powerlaw library.

    Though, powerlaw implements its own distribution_compare capability which returns the likelihood ratio R and the p-val of R (see some content from Aaron Clauset on here):

    R : float Loglikelihood ratio of the two distributions' fit to the data. If greater than 0, the first distribution is preferred. If less than 0, the second distribution is preferred.

    p : float Significance of R

    from numpy import genfromtxt
    import urllib
    import powerlaw
    
    urllib.urlretrieve('https://raw.github.com/jeffalstott/powerlaw/master/manuscript/words.txt', 'words.txt')
    words = genfromtxt('words.txt')
    
    fit = powerlaw.Fit(words, discrete=True)
    
    print(fit.distribution_compare('power_law', 'exponential', normalized_ratio=True))
    (9.135914718776998, 6.485614241379581e-20)
    print(fit.distribution_compare('power_law', 'truncated_power_law'))
    (-0.917123083373983, 0.1756268316869548)
    print(fit.distribution_compare('power_law', 'truncated_power_law'))
    (-0.917123083373983, 0.1756268316869548)
    print(fit.distribution_compare('power_law', 'lognormal'))
    (0.008785246720842022, 0.9492243713193919)