Search code examples
arraysjuliacategorical-data

Julia: What is the perfect way to convert a categorical array to a numeric array?


What is the perfect way to convert a categorical array to a simple numeric array? For example:

using CategoricalArrays
a = CategoricalArray(["X", "X", "Y", "Z", "Y", "Y", "Z"])
b = recode(a, "X"=>1, "Y"=>2, "Z"=>3)

As a result of the conversion, we still get a categorical array, even if we explicitly specify the type of assigned values:

b = recode(a, "X"=>1::Int64, "Y"=>2::Int64, "Z"=>3::Int64)

It looks like some other approach is needed here, but I can't think of a direction to look in


Solution

  • You have two natural options:

    julia> recode(unwrap.(a), "X"=>1, "Y"=>2, "Z"=>3)
    7-element Vector{Int64}:
     1
     1
     2
     3
     2
     2
     3
    

    or

    julia> mapping = Dict("X"=>1, "Y"=>2, "Z"=>3)
    Dict{String, Int64} with 3 entries:
      "Y" => 2
      "Z" => 3
      "X" => 1
    
    julia> [mapping[v] for v in a]
    7-element Vector{Int64}:
     1
     1
     2
     3
     2
     2
     3
    

    the Dict approach is slower, but it is more flexible in case you would have many levels to map.

    The key function here is unwrap that drops the "categorical" notion of CategoricalValue (in the Dict style unwrap gets called automatically)

    Also note that if you just want to get the levelcodes of the values stored in a CategoricalArray (something that R does by default) then you can just do:

    julia> levelcode.(a)
    7-element Vector{Int64}:
     1
     1
     2
     3
     2
     2
     3
    

    Also note that with levelcode missing is mapped to missing:

    julia> x = CategoricalArray(["Y", "X", missing, "Z"])
    4-element CategoricalArray{Union{Missing, String},1,UInt32}:
     "Y"
     "X"
     missing
     "Z"
    
    julia> levelcode.(x)
    4-element Vector{Union{Missing, Int64}}:
     2
     1
      missing
     3