Search code examples
statisticsdistributionjuliainequality

Gini Coefficient in Julia: Efficient and Accurate Code


I'm trying to implement the following formula in Julia for calculating the Gini coefficient of a wage distribution:

enter image description here

where enter image description here

Here's a simplified version of the code I'm using for this:

# Takes a array where first column is value of wages
# (y_i in formula), and second column is probability
# of wage value (f(y_i) in formula).
function gini(wagedistarray)
    # First calculate S values in formula
    for i in 1:length(wagedistarray[:,1])
        for j in 1:i
            Swages[i]+=wagedistarray[j,2]*wagedistarray[j,1]
        end
    end

    # Now calculate value to subtract from 1 in gini formula
    Gwages = Swages[1]*wagedistarray[1,2]
    for i in 2:length(Swages)
        Gwages += wagedistarray[i,2]*(Swages[i]+Swages[i-1])
    end

    # Final step of gini calculation
    return giniwages=1-(Gwages/Swages[length(Swages)])          
end

wagedistarray=zeros(10000,2)                                 
Swages=zeros(length(wagedistarray[:,1]))                    

for i in 1:length(wagedistarray[:,1])
   wagedistarray[i,1]=1
   wagedistarray[i,2]=1/10000
end


@time result=gini(wagedistarray)

It gives a value of near zero, which is what you expect for a completely equal wage distribution. However, it takes quite a long time: 6.796 secs.

Any ideas for improvement?


Solution

  • Try this:

    function gini(wagedistarray)
        nrows = size(wagedistarray,1)
        Swages = zeros(nrows)
        for i in 1:nrows
            for j in 1:i
                Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
            end
        end
    
        Gwages=Swages[1]*wagedistarray[1,2]
        for i in 2:nrows
            Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
        end
    
        return 1-(Gwages/Swages[length(Swages)])
    
    end
    
    wagedistarray=zeros(10000,2)
    for i in 1:size(wagedistarray,1)
       wagedistarray[i,1]=1
       wagedistarray[i,2]=1/10000
    end
    
    @time result=gini(wagedistarray)
    
    • Time before: 5.913907256 seconds (4000481676 bytes allocated, 25.37% gc time)
    • Time after: 0.134799301 seconds (507260 bytes allocated)
    • Time after (second run): elapsed time: 0.123665107 seconds (80112 bytes allocated)

    The primary problems are that Swages was a global variable (wasn't living in the function) which is not a good coding practice, but more importantly is a performance killer. The other thing I noticed was length(wagedistarray[:,1]), which makes a copy of that column and then asks its length - that was generating some extra "garbage". The second run is faster because there is some compilation time the very first time the function is run.

    You crank performance even higher using @inbounds, i.e.

    function gini(wagedistarray)
        nrows = size(wagedistarray,1)
        Swages = zeros(nrows)
        @inbounds for i in 1:nrows
            for j in 1:i
                Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
            end
        end
    
        Gwages=Swages[1]*wagedistarray[1,2]
        @inbounds for i in 2:nrows
            Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
        end
    
        return 1-(Gwages/Swages[length(Swages)])
    end
    

    which gives me elapsed time: 0.042070662 seconds (80112 bytes allocated)

    Finally, check out this version, which is actually faster than all and is also the most accurate I think:

    function gini2(wagedistarray)
        Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])
        Gwages = Swages[1]*wagedistarray[1,2] +
                    sum(wagedistarray[2:end,2] .* 
                            (Swages[2:end]+Swages[1:end-1]))
        return 1 - Gwages/Swages[end]
    end
    

    Which has elapsed time: 0.00041119 seconds (721664 bytes allocated). The main benefit was changing from a O(n^2) double for loop to the O(n) cumsum.