Search code examples
rdesign-patternspower-law

KS test for power law


Im attempting fitting a powerlaw distribution to a data set, using the method outlined by Aaron Clauset, Cosma Rohilla Shalizi and M.E.J. Newman in their paper "Power-Law Distributions in Empirical Data".

I've found code to compare to my own, but im a bit mystified where some of it comes from, the story thus far is,

to identify a suitable xmin for the powerlaw fit, we take each possible xmin fit a powerlaw to that data and then compute the corresponding exponet (a) then the KS statistic (D) for the fit and the observed data, then find the xmin that corresponds to the minimum of D. The KS statistic if computed as follows,

cx   <- c(0:(n-1))/n # n is the sample size for the data >= xmin
cf   <- 1-(xmin/z)^a # the cdf for a powerlaw z = x[x>=xmin]
D <- max(abs(cf-cx))

what i dont get is where cx comes for, surely we should be comparing the distance between the empirical distributions and the calculated distribution. something along the lines of:

cx = ecdf(sort(z))
cf   <- 1-(xmin/z)^a
D <- max(abs(cf-cx(z)))

I think im just missing something very basic but please do correct me!


Solution

  • The answer is that they are (almost) the same. The easiest way to see this is to generate some data:

    z = sort(runif(5,xmin, 10*xmin))
    n = length(x)
    

    Then examine the values of the two CDFs

    R> (cx1 = c(0:(n-1))/n)
    [1] 0.0 0.2 0.4 0.6 0.8
    R> (cx2 = ecdf(sort(z)))
    [1] 0.2 0.4 0.6 0.8 1.0
    

    Notice that they are almost the same - essentially the cx1 gives the CDF for greater than or equal to whilst cx2 is greater than.

    The advantage of the top approach is that it is very efficient and quick to calculate. The disadvantage is that if your data isn't truly continuous, i.e. z=c(1,1,2), cx1 is wrong. But then you shouldn't be fitting your data to a CTN distribution if this were the case.