Search code examples
dataframejuliamissing-data

Find a subset of columns of a data frame that have some missing values


Given the following data frame from DataFrames.jl:

julia> using DataFrames

julia> df = DataFrame(x1=[1, 2, 3], x2=Union{Int,Missing}[1, 2, 3], x3=[1, 2, missing])
3×3 DataFrame
 Row │ x1     x2      x3
     │ Int64  Int64?  Int64?
─────┼────────────────────────
   1 │     1       1        1
   2 │     2       2        2
   3 │     3       3  missing

I would like to find columns that contain missing value in them.

I have tried:

julia> names(df, Missing)
String[]

but this is incorrect as the names function, when passed a type, looks for subtypes of the passed type.


Solution

  • If you want to find columns that actually contain missing value use:

    julia> names(df, any.(ismissing, eachcol(df)))
    1-element Vector{String}:
     "x3"
    

    In this approach we iterate each column of the df data frame and check if it contains at least one missing value.

    If you want to find columns that potentially can contain missing value you need to check their element type:

    julia> names(df, [eltype(col) >: Missing for col in eachcol(df)]) # using a comprehension
    2-element Vector{String}:
     "x2"
     "x3"
    
    julia> names(df, .>:(eltype.(eachcol(df)), Missing)) # using broadcasting
    2-element Vector{String}:
     "x2"
     "x3"