Search code examples
dataframememoryjuliasize

How to get the memory size of an Julia DataFrame?


I would like to optimize (the columns in) a Julia DataFrame. To do so, I would like to get the size of the the DataFrame before and after an optimization.

Here's an example DataFrame:

rows, columns = 10_000, 50
df = rand([x for x in "ABCDE"], rows, columns) |> DataFrame

The size of this df object...

sizeof(df)

The size is 24.

However, when I sum the sizes of the columns, the size is diffferent...

sum([sizeof(df[x]) for x in names(df)])

The sum of the column sizes is 2000000.

Here's the optimization...

for i = names(df)
    df[i] = CategoricalArray(df[i], ordered=false)
end

Results are:

sizeof(df)

The size is 24.

sum([sizeof(df[x]) for x in names(df)])

The sum of the column sizes is 800.

Any suggestions how to get an accurate size of an DataFrame?


Solution

  • Here is how a way how you can do it:

    julia> df = DataFrame(rand([x for x in "ABCDE"], rows, columns), :auto);
    
    julia> Base.summarysize(df)
    2007456
    
    julia> Base.summarysize(mapcols(PooledArray, df)) # this will change in the next release of PooledArrays.jl as the default size of refarray element will be UInt32
    525656
    
    julia> Base.summarysize(mapcols(categorical, df))
    2037256
    
    julia> Base.summarysize(mapcols(x -> categorical(x, compress=true), df))
    534856
    

    Note though that in this case it is not much, as all your columns have Char element type. You would get much more benefit if had columns holding long strings.