Search code examples
replacejuliacomparisondataframes.jl

Julia DataFrames: Replace entries in a dataframe based on a comparison with another dataframe


I have the following dataframes:

df1 = DataFrame(
    col_A = [1, 2, 3, 4, 5, 6, 7],
    col_B = ["A", "B", "C", "D", "E", "F", "G"],
    col_C = missing,
)

7×3 DataFrame
 Row │ col_A  col_B   col_C   
     │ Int64  String  Missing 
─────┼────────────────────────
   1 │     1  "A"     missing 
   2 │     2  "B"     missing 
   3 │     3  "C"     missing 
   4 │     4  "D"     missing 
   5 │     5  "E"     missing 
   6 │     6  "F"     missing 
   7 │     7  "G"     missing

df2 = DataFrame(
    col_X = [1, 2, 3, 4, 5, 5],
    col_Y = ["A", "nope", "C", "nope", "E", "E"],
    col_Z = ["First", "Second", "Third", "Fourth", "Fifth", "Duplicated"]
)

6×3 DataFrame
 Row │ col_X  col_Y   col_Z      
     │ Int64  String  String     
─────┼───────────────────────────
   1 │     1  "A"     "First"
   2 │     2  "nope"  "Second"
   3 │     3  "C"     "Third"
   4 │     4  "nope"  "Fourth"
   5 │     5  "E"     "Fifth"
   6 │     5  "E"     "Duplicated"

I need to efficiently replace the values of df1.col_C with those of df2.col_Z if there's a match between -let's say- composite keys made from the first 2 columns in both dataframes (e.g. (1, "A") occurs in both, but (2, "B") doesn't), and otherwise keep things unchanged. If there were duplicated composite keys, get the last occurrence in df2.

So df1 would become:

7×3 DataFrame
 Row │ col_A  col_B   col_C      
     │ Int64  String  String?    
─────┼───────────────────────────
   1 │     1  "A"     "First"
   2 │     2  "B"     missing    
   3 │     3  "C"     "Third"
   4 │     4  "D"     missing    
   5 │     5  "E"     "Duplicated"
   6 │     6  "F"     missing    
   7 │     7  "G"     missing    

Solution

  • Is this satisfactory?

    julia> df1.col_C .= ifelse.(df1.col_A .== df2.col_X .&& 
                                df1.col_B .== df2.col_Y, 
                                df2.col_Z, missing)
    5-element Vector{Union{Missing, String}}:
     "First"
     missing
     missing
     missing
     "Fifth"
    

    Using Bogumil's answer, I think:

    mapping = Dict(zip(df2.col_X, df2.col_Y) .=> df2.col_Z)
    df1.col_C = [get(mapping, k, missing) 
      for k in zip(df1.col_A, df1.col_B)]
    

    will fix the un-aligned dfs problem you commented about.