Search code examples
juliajulia-dataframe

ArgumentError: No key column found Unstack Error?


I have a df like below,

Sample Input:

4×2 DataFrame
│ Row │ col1   │ col2  │
│     │ String │ Int64 │
├─────┼────────┼───────┤
│ 1   │ l1     │ 1     │
│ 2   │ l2     │ 2     │
│ 3   │ l1     │ 3     │
│ 4   │ l2     │ 4     │

I want to transform the above df to like below,

Expected df:

2×2 DataFrame
│ Row │ l1    │ l2    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │
│ 2   │ 3     │ 4     │

I tried unstack approach unstack(df, "col1", "col2") but I got ArgumentError: No key column found. I understood that without key unstack is not possible. How to transform the above df to expected df?


Solution

  • DataFrames.jl requires you to specify keys for rows to allow to perform matching them by these keys:

    julia> df = DataFrame(col1=["l1", "l2", "l1", "l2"], col2=1:4, rowkey=[1,1,2,2])
    4×3 DataFrame
    │ Row │ col1   │ col2  │ rowkey │
    │     │ String │ Int64 │ Int64  │
    ├─────┼────────┼───────┼────────┤
    │ 1   │ l1     │ 1     │ 1      │
    │ 2   │ l2     │ 2     │ 1      │
    │ 3   │ l1     │ 3     │ 2      │
    │ 4   │ l2     │ 4     │ 2      │
    
    julia> unstack(df, "col1", "col2")
    2×3 DataFrame
    │ Row │ rowkey │ l1     │ l2     │
    │     │ Int64  │ Int64? │ Int64? │
    ├─────┼────────┼────────┼────────┤
    │ 1   │ 1      │ 1      │ 2      │
    │ 2   │ 2      │ 3      │ 4      │
    

    Why? Imagine your data looks like this:

    julia> df = DataFrame(col1=["l1", "l2", "l1", "l2","l1"], col2=1:5, rowkey=[1,1,3,3,2])
    5×3 DataFrame
    │ Row │ col1   │ col2  │ rowkey │
    │     │ String │ Int64 │ Int64  │
    ├─────┼────────┼───────┼────────┤
    │ 1   │ l1     │ 1     │ 1      │
    │ 2   │ l2     │ 2     │ 1      │
    │ 3   │ l1     │ 3     │ 3      │
    │ 4   │ l2     │ 4     │ 3      │
    │ 5   │ l1     │ 5     │ 2      │
    
    julia> unstack(df, "col1", "col2")
    3×3 DataFrame
    │ Row │ rowkey │ l1     │ l2      │
    │     │ Int64  │ Int64? │ Int64?  │
    ├─────┼────────┼────────┼─────────┤
    │ 1   │ 1      │ 1      │ 2       │
    │ 2   │ 2      │ 5      │ missing │
    │ 3   │ 3      │ 3      │ 4       │
    

    without :rowkey it would be not possible to tell that you actually want the second row in :l2 to hold a missing value.

    This is how unstack works as it is a general function, so it uses row keys to perform matching. If you want something that assumes:

    • all groups are in the same orders
    • all groups have the same lengths

    write this:

    julia> DataFrame([first(sdf.col1) => sdf.col2 for sdf in groupby(df, :col1)])
    2×2 DataFrame
    │ Row │ l1    │ l2    │
    │     │ Int64 │ Int64 │
    ├─────┼───────┼───────┤
    │ 1   │ 1     │ 2     │
    │ 2   │ 3     │ 4     │
    

    or a bit more general pattern

    julia> DataFrame([key.col1 => sdf.col2 for (key, sdf) in pairs(groupby(df, :col1))])
    2×2 DataFrame
    │ Row │ l1    │ l2    │
    │     │ Int64 │ Int64 │
    ├─────┼───────┼───────┤
    │ 1   │ 1     │ 2     │
    │ 2   │ 3     │ 4     │