Search code examples
statisticsjuliasummary

Julia - describe() function display incomplete summary statistics


I'm trying basic data analysis with Julia

I'm following this tutorial with the train datasets that can be found here (the one named train_u6lujuX_CVtuZ9i.csv) with the following code:

using DataFrames, RDatasets, CSV, StatsBase
train = CSV.read("/Path/to/train_u6lujuX_CVtuZ9i.csv");
describe(train[:LoanAmount])

and get this output:

Summary Stats:
Length:         614
Type:           Union{Missing, Int64}
Number Unique:  204

instead of the output of the tutorial:

Summary Stats:
Mean:           146.412162
Minimum:        9.000000
1st Quartile:   100.000000
Median:         128.000000
3rd Quartile:   168.000000
Maximum:        700.000000
Length:         592
Type:           Int64
% Missing:      3.583062

Which also corresponds to the output of StatsBase.jl that the describe() function should give


Solution

  • This is how it is currently (in the current release) implemented in StatsBase.jl. In short train.LoanAmount does not have eltype that is subtype of Real and then StatsBase.jl uses a fallback method that only prints length, eltype and number of unique values. You can write describe(collect(skipmissing(train.LoanAmount))) to get summary statistics (except number of missings of course).

    Actually, however, I would recommend you to use another approach. If you want to get a more verbose output on a single column use:

    describe(train, :all, cols=:LoanAmount)
    

    you will get an output that additionally is returned as a DataFrame so that you can not only see the statistics but also access them.

    Option :all will print all statistics please refer to describe docstring in DataFrames.jl to see available options.

    You can find some examples of using this function on a current release of DataFrames.jl here.