Search code examples
statisticsjuliaregressionglm

How can I get a consistent return type from this function?


Is there any way to get the function below to return a consistent type? I'm doing some work with Julia GLM (love it). I wrote a function that creates all of the possible regression combinations for a dataset. However, my current method of creating a @formula returns a different type for every different length of rhs.

using GLM

function compose(lhs::Symbol, rhs::AbstractVector{Symbol})
    ts = term.((1, rhs...))
    term(lhs) ~ sum(ts)
end

Using @code_warntype for a simple example returns the following

julia> @code_warntype compose(:y, [:x])
Variables
  #self#::Core.Compiler.Const(compose, false)
  lhs::Symbol
  rhs::Array{Symbol,1}
  ts::Any

Body::FormulaTerm{Term,_A} where _A
1 ─ %1 = Core.tuple(1)::Core.Compiler.Const((1,), false)
│   %2 = Core._apply(Core.tuple, %1, rhs)::Core.Compiler.PartialStruct(Tuple{Int64,Vararg{Symbol,N} where N}, Any[Core.Compiler.Const(1, false), Vararg{Symbol,N} where N])
│   %3 = Base.broadcasted(Main.term, %2)::Base.Broadcast.Broadcasted{Base.Broadcast.Style{Tuple},Nothing,typeof(term),_A} where _A<:Tuple
│        (ts = Base.materialize(%3))
│   %5 = Main.term(lhs)::Term
│   %6 = Main.sum(ts)::Any
│   %7 = (%5 ~ %6)::FormulaTerm{Term,_A} where _A
└──      return %7

And checking the return type of a few different inputs:

julia> compose(:y, [:x]) |> typeof
FormulaTerm{Term,Tuple{ConstantTerm{Int64},Term}}

julia> compose(:y, [:x1, :x2]) |> typeof
FormulaTerm{Term,Tuple{ConstantTerm{Int64},Term,Term}}

We see that as the length of rhs changes, the return type changes.

Can I change my compose function so that it always returns the same type? This isn't really a big issue. Compiling for each new number of regressors only takes ~70ms. This is really more of a "how can I improve my Julia skills?"


Solution

  • I do not think you can avoid type unstability here as ~ expects RHS to be a Term or a Tuple of Terms.

    However, the most compilation cost you are paying is in term.((1, rhs...)) as you invoke broadcasting which is expensive to compile. Here is how you can do it in a cheaper way:

    function compose(lhs::Symbol, rhs::AbstractVector{Symbol})
        term(lhs) ~ ntuple(i -> i <= length(rhs) ? term(rhs[i]) : term(1) , length(rhs)+1)
    end
    

    or (this is a bit slower but more like your original code):

    function compose(lhs::Symbol, rhs::AbstractVector{Symbol})
        term(lhs) ~ map(term, (1, rhs...))
    end
    

    Finally - if you are doing such computations maybe you can drop using the formula interface but feed to lm or glm directly matrices as RHS in which case it should be able to avoid extra compilation cost, e.g.:

    julia> y = rand(10);
    
    julia> x = rand(10, 2);
    
    julia> @time lm(x,y);
      0.000048 seconds (18 allocations: 1.688 KiB)
    
    julia> x = rand(10, 3);
    
    julia> @time lm(x,y);
      0.000038 seconds (18 allocations: 2.016 KiB)
    
    julia> y = rand(100);
    
    julia> x = rand(100, 50);
    
    julia> @time lm(x,y);
      0.000263 seconds (22 allocations: 121.172 KiB)