Search code examples
dataframejuliamissing-data

Remove rows with all missing values for columns that start with certain name in dataframe Julia


I have the following dataframe:

using DataFrames

df = DataFrame(
           group = ["A", "A", "A", "B", "B", "B"],
           V1 = [1, missing, missing, 3, missing, missing],
           V2 = [missing, missing, missing, 2, missing, missing],
           V3 = [missing, missing, 4, missing, 1, missing],
           Z1 = [3, missing, missing, 3, missing, missing],
           Z2 = [3, 1, 5, 2, missing, 3],
           Z3 = [missing, missing, 2, missing, missing, missing])

6×7 DataFrame
 Row │ group   V1       V2       V3       Z1       Z2       Z3      
     │ String  Int64?   Int64?   Int64?   Int64?   Int64?   Int64?  
─────┼──────────────────────────────────────────────────────────────
   1 │ A             1  missing  missing        3        3  missing 
   2 │ A       missing  missing  missing  missing        1  missing 
   3 │ A       missing  missing        4  missing        5        2
   4 │ B             3        2  missing        3        2  missing 
   5 │ B       missing  missing        1  missing  missing  missing 
   6 │ B       missing  missing  missing  missing        3  missing 

I would like to remove the rows with all values missing, but only where the columns start with "V" in their column names. This means that row 2 and 6 should be removed because they have all values missing across the columns that start with "V". The desired output should look like this:

4×7 DataFrame
 Row │ group   V1       V2       V3       Z1       Z2       Z3      
     │ String  Int64?   Int64?   Int64?   Int64?   Int64?   Int64?  
─────┼──────────────────────────────────────────────────────────────
   1 │ A             1  missing  missing        3        3  missing 
   2 │ A       missing  missing        4  missing        5        2
   3 │ B             3        2  missing        3        2  missing 
   4 │ B       missing  missing        1  missing  missing  missing 

So I was wondering if anyone knows how to remove rows where all values are missing across columns that start with certain column name in a dataframe Julia?


Solution

  • You can use the deleteat! function to drop the rows of the given data frame with the given indexes:

    deleteat!(df, all.(ismissing, eachrow(df[!, r"V"])))
    # 4×7 DataFrame
    #  Row │ group   V1       V2       V3       Z1       Z2       Z3
    #      │ String  Int64?   Int64?   Int64?   Int64?   Int64?   Int64?
    # ─────┼──────────────────────────────────────────────────────────────
    #    1 │ A             1  missing  missing        3        3  missing
    #    2 │ A       missing  missing        4  missing        5        2
    #    3 │ B             3        2  missing        3        2  missing
    #    4 │ B       missing  missing        1  missing  missing  missing
    

    Another way is following this approach (slicing by a mask):

    mask = map(x->!all(ismissing, x), eachrow(df[!, r"V.*"]))
    df[mask, :]
    # 4×7 DataFrame
    #  Row │ group   V1       V2       V3       Z1       Z2       Z3
    #      │ String  Int64?   Int64?   Int64?   Int64?   Int64?   Int64?
    # ─────┼──────────────────────────────────────────────────────────────
    #    1 │ A             1  missing  missing        3        3  missing
    #    2 │ A       missing  missing        4  missing        5        2
    #    3 │ B             3        2  missing        3        2  missing
    #    4 │ B       missing  missing        1  missing  missing  missing
    
    # Or
    mask = broadcast(~, all.(ismissing, eachrow(df[!, r"V"])))
    df[mask, :]
    
    # Or
    df[Not(all.(ismissing, eachrow(df[!, r"V"]))), :]
    

    The r"V.*" is a RegEx that is allowed for indexing by DataFrames.jl. Its interpretation:

    • V: Starts with the V letter.
    • .: Any char can appear.
    • .*: Any char(s) indefinitely can appear.

    The pattern could be r"^V" which catches any sequence of chars that starts with the V letter, or even an r"V" could be enough.
    Following this approach, another way is to create a mask DataFrame:

    maskdf = select(df, AsTable(r"V") => ByRow(x-> !all(ismissing, x)) => :mask)
    # 6×1 DataFrame
    #  Row │ mask
    #      │ Bool
    # ─────┼───────
    #    1 │  true
    #    2 │ false
    #    3 │  true
    #    4 │  true
    #    5 │  true
    #    6 │ false
    
    df[maskdf.mask, :]
    # returns the desired result.