Search code examples
group-bytime-seriesjulialag

Create lag / lead time series with by groups in Julia?


I am wondering if there is an easy way to create a lag (or lead) of a time series variable in Julia according to a by group or condition? For example: I have a dataset of the following form

julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
                             var2=[0,1,2,3,0,1,2,3])
8×2 DataFrame
│ Row │ var1   │ var2  │
│     │ String │ Int64 │
├─────┼────────┼───────┤
│ 1   │ a      │ 0     │
│ 2   │ a      │ 1     │
│ 3   │ a      │ 2     │
│ 4   │ a      │ 3     │
│ 5   │ b      │ 0     │
│ 6   │ b      │ 1     │
│ 7   │ b      │ 2     │
│ 8   │ b      │ 3     │

And I want to create a variable lag2 that contains the values in var2 lagged by 2. However, this should be done grouped by var1 so that the first two observations in the 'b' group do not get the last two values of the 'a' group. Rather they should be set to missing or zero or some default value.

I have tried the following code which produces the following error.

julia> df2 = df1 |> @groupby(_.var1) |> @mutate(lag2 = lag(_.var2,2)) |> DataFrame

ERROR: MethodError: no method matching merge(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}, ::NamedTuple{(:lag2,),Tuple{ShiftedArray{Int64,Missing,1,QueryOperators.GroupColumnArrayView{Int64,Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},:var2}}}})
Closest candidates are:
  merge(::NamedTuple{,T} where T<:Tuple, ::NamedTuple) at namedtuple.jl:245
  merge(::NamedTuple{an,T} where T<:Tuple, ::NamedTuple{bn,T} where T<:Tuple) where {an, bn} at namedtuple.jl:233
  merge(::NamedTuple, ::NamedTuple, ::NamedTuple...) at namedtuple.jl:249
  ...
Stacktrace:
 [1] (::var"#437#442")(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}) at /Users/kayvon/.julia/packages/Query/AwBtd/src/query_translation.jl:58
 [2] iterate at /Users/kayvon/.julia/packages/QueryOperators/g4G21/src/enumerable/enumerable_map.jl:25 [inlined]
 [3] iterate at /Users/kayvon/.julia/packages/Tables/TjjiP/src/tofromdatavalues.jl:45 [inlined]
 [4] buildcolumns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:185 [inlined]
 [5] columns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:237 [inlined]
 [6] #DataFrame#453(::Bool, ::Type{DataFrame}, ::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:40
 [7] DataFrame(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:31
 [8] |>(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}, ::Type) at ./operators.jl:854
 [9] top-level scope at none:0

Appreciate any help with this approach or alternate approaches. Thanks.


Solution

  • EDIT

    Putting this edit to the top as it works in DataFrames 1.0 so reflects the stable API:

    Under DataFrames.jl 0.22.2 the correct syntax is:

    julia> combine(groupby(df1, :var1), :var2 => Base.Fix2(lag, 2) => :var2_l2)
    8×2 DataFrame
     Row │ var1    var2_l2 
         │ String  Int64?  
    ─────┼─────────────────
       1 │ a       missing 
       2 │ a       missing 
       3 │ a             0
       4 │ a             1
       5 │ b       missing 
       6 │ b       missing 
       7 │ b             0
       8 │ b             1
    

    Another alternative to the maybe slightly arcane Base.Fix2 syntax you could use an anonymous function (x -> lag(x, 2)) (note the enclosing parens are required due to operator precedence).


    Original answer:

    You definitely had the right idea - I don't work with Query.jl but this can easily be done with basic DataFrames syntax:

    julia> using DataFrames
    
    julia> import ShiftedArrays: lag
    
    julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
                                        var2=[0,1,2,3,0,1,2,3]);
    
    julia> by(df1, :var1, var2_l2 = :var2 => Base.Fix2(lag, 2)))
    8×2 DataFrame
    │ Row │ var1   │ var2_l2 │
    │     │ String │ Int64⍰  │
    ├─────┼────────┼─────────┤
    │ 1   │ a      │ missing │
    │ 2   │ a      │ missing │
    │ 3   │ a      │ 0       │
    │ 4   │ a      │ 1       │
    │ 5   │ b      │ missing │
    │ 6   │ b      │ missing │
    │ 7   │ b      │ 0       │
    │ 8   │ b      │ 1       │
    

    Note that I used Base.Fix2 here to get a single argument version of lag. This is essentially the same as defining your own l2(x) = lag(x, 2) and then using l2 in the by call. If you do define your own l2 function you can also set the default value like l2(x) = lag(x, 2, default = -1000) if you want to avoid missing values:

    julia> l2(x) = lag(x, 2, default = -1000)
    l2 (generic function with 1 method)
    
    julia> by(df1, :var1, var2_l2 = :var2 => l2)
    8×2 DataFrame
    │ Row │ var1   │ var2_l2 │
    │     │ String │ Int64   │
    ├─────┼────────┼─────────┤
    │ 1   │ a      │ -1000   │
    │ 2   │ a      │ -1000   │
    │ 3   │ a      │ 0       │
    │ 4   │ a      │ 1       │
    │ 5   │ b      │ -1000   │
    │ 6   │ b      │ -1000   │
    │ 7   │ b      │ 0       │
    │ 8   │ b      │ 1       │