I would like to optimize (the columns in) a Julia DataFrame. To do so, I would like to get the size of the the DataFrame before and after an optimization.
Here's an example DataFrame:
rows, columns = 10_000, 50
df = rand([x for x in "ABCDE"], rows, columns) |> DataFrame
The size of this df
object...
sizeof(df)
The size is 24.
However, when I sum the sizes of the columns, the size is diffferent...
sum([sizeof(df[x]) for x in names(df)])
The sum of the column sizes is 2000000.
Here's the optimization...
for i = names(df)
df[i] = CategoricalArray(df[i], ordered=false)
end
Results are:
sizeof(df)
The size is 24.
sum([sizeof(df[x]) for x in names(df)])
The sum of the column sizes is 800.
Any suggestions how to get an accurate size of an DataFrame?
Here is how a way how you can do it:
julia> df = DataFrame(rand([x for x in "ABCDE"], rows, columns), :auto);
julia> Base.summarysize(df)
2007456
julia> Base.summarysize(mapcols(PooledArray, df)) # this will change in the next release of PooledArrays.jl as the default size of refarray element will be UInt32
525656
julia> Base.summarysize(mapcols(categorical, df))
2037256
julia> Base.summarysize(mapcols(x -> categorical(x, compress=true), df))
534856
Note though that in this case it is not much, as all your columns have Char
element type. You would get much more benefit if had columns holding long strings.