I would like to optimize (the columns in) a Julia DataFrame. To do so, I would like to get the size of the the DataFrame before and after an optimization.
Here's an example DataFrame:
rows, columns = 10_000, 50
df = rand([x for x in "ABCDE"], rows, columns) |> DataFrame
The size of this df
The size is 24.
However, when I sum the sizes of the columns, the size is diffferent...
sum([sizeof(df[x]) for x in names(df)])
The sum of the column sizes is 2000000.
Here's the optimization...
for i = names(df)
df[i] = CategoricalArray(df[i], ordered=false)
Results are:
The size is 24.
sum([sizeof(df[x]) for x in names(df)])
The sum of the column sizes is 800.
Any suggestions how to get an accurate size of an DataFrame?
Here is how a way how you can do it:
julia> df = DataFrame(rand([x for x in "ABCDE"], rows, columns), :auto);
julia> Base.summarysize(df)
julia> Base.summarysize(mapcols(PooledArray, df)) # this will change in the next release of PooledArrays.jl as the default size of refarray element will be UInt32
julia> Base.summarysize(mapcols(categorical, df))
julia> Base.summarysize(mapcols(x -> categorical(x, compress=true), df))
Note though that in this case it is not much, as all your columns have Char
element type. You would get much more benefit if had columns holding long strings.