I'm trying basic data analysis with Julia
I'm following this tutorial with the train datasets that can be found here (the one named train_u6lujuX_CVtuZ9i.csv
) with the following code:
using DataFrames, RDatasets, CSV, StatsBase
train = CSV.read("/Path/to/train_u6lujuX_CVtuZ9i.csv");
describe(train[:LoanAmount])
and get this output:
Summary Stats:
Length: 614
Type: Union{Missing, Int64}
Number Unique: 204
instead of the output of the tutorial:
Summary Stats:
Mean: 146.412162
Minimum: 9.000000
1st Quartile: 100.000000
Median: 128.000000
3rd Quartile: 168.000000
Maximum: 700.000000
Length: 592
Type: Int64
% Missing: 3.583062
Which also corresponds to the output of StatsBase.jl that the describe()
function should give
This is how it is currently (in the current release) implemented in StatsBase.jl. In short train.LoanAmount
does not have eltype
that is subtype of Real
and then StatsBase.jl uses a fallback method that only prints length, eltype and number of unique values. You can write describe(collect(skipmissing(train.LoanAmount)))
to get summary statistics (except number of missings of course).
Actually, however, I would recommend you to use another approach. If you want to get a more verbose output on a single column use:
describe(train, :all, cols=:LoanAmount)
you will get an output that additionally is returned as a DataFrame
so that you can not only see the statistics but also access them.
Option :all
will print all statistics please refer to describe
docstring in DataFrames.jl to see available options.
You can find some examples of using this function on a current release of DataFrames.jl here.