Search code examples
juliaavro

Julia & Avro.jl : Issue with tuples


I am trying to get Avro working in Julia and having some real issues. It is important for my application that I use a row-oriented data format to which I can append a hierarchical data structure row by row as they are generated.

Avro seems like a good fit. But I am having issues in Julia. I have things working in Python test, but I need to be in Julia as the main code is in julia.

Here are my simplified test examples which show my issue. The first one works, the rest don't. Any help would be appreciated. The second gives the wrong answer. The rest give errors.

import Avro
v1=Dict("RUTHERFORD"  => 7, "DURHAM" => 11)
buf=Avro.write(v1)
Avro.read(buf,typeof(v1))

output:

Dict{String, Int64} with 2 entries:
  "DURHAM"     => 11
  "RUTHERFORD" => 7

example 2:

@show v3=Dict((5,2)  => 7, (5,4) => 11)
@show typeof(v3)
buf=Avro.write(v3)
Avro.read(buf,typeof(v3))

output:

v3 = Dict((5, 2) => 7, (5, 4) => 11) = Dict((5, 2) => 7, (5, 4) => 11)
typeof(v3) = Dict{Tuple{Int64, Int64}, Int64}
Dict{Tuple{Int64, Int64}, Int64} with 1 entry:
  (40, 53) => 11

example 3:

@show v2=Dict(("jcm",2)  => 7, ("sem",4) => 11)
@show typeof(v2)
buf=Avro.write(v2)
v2o=Avro.read(buf,typeof(v2))

output:

v2 = Dict(("jcm", 2) => 7, ("sem", 4) => 11) = Dict(("sem", 4) => 11, ("jcm", 2) => 7)
typeof(v2) = Dict{Tuple{String, Int64}, Int64}
MethodError: Cannot `convert` an object of type Char to an object of type String
Closest candidates are:
  convert(::Type{String}, ::String) at essentials.jl:210
  convert(::Type{T}, ::T) where T<:AbstractString at strings/basic.jl:231
  convert(::Type{T}, ::AbstractString) where T<:AbstractString at strings/basic.jl:232
  ...

Stacktrace:
  [1] _totuple
    @ ./tuple.jl:316 [inlined]
  [2] Tuple{String, Int64}(itr::String)
    @ Base ./tuple.jl:303
  [3] construct(T::Type, args::String; kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:310
  [4] construct(T::Type, args::String)
    @ StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:310
  [5] construct(::Type{Tuple{String, Int64}}, ptr::Ptr{UInt8}, len::Int64; kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:435
  [6] construct(::Type{Tuple{String, Int64}}, ptr::Ptr{UInt8}, len::Int64)
    @ StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:435
  [7] readvalue(B::Avro.Binary, #unused#::Avro.StringType, #unused#::Type{Tuple{String, Int64}}, buf::Vector{UInt8}, pos::Int64, len::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:247
  [8] readvalue(B::Avro.Binary, MT::Avro.MapType, #unused#::Type{Dict{Tuple{String, Int64}, Int64}}, buf::Vector{UInt8}, pos::Int64, buflen::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Avro ~/.julia/packages/Avro/JEoRa/src/types/maps.jl:63
  [9] read(buf::Vector{UInt8}, ::Type{Dict{Tuple{String, Int64}, Int64}}; schema::Avro.MapType, jsonencoding::Bool, kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:58
 [10] read(buf::Vector{UInt8}, ::Type{Dict{Tuple{String, Int64}, Int64}})
    @ Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:58
 [11] top-level scope
    @ In[209]:5
 [12] eval
    @ ./boot.jl:360 [inlined]
 [13] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094

Last example:

v=Dict(("RUTHERFORD", "05A", "371619611022065")   => 7, ("DURHAM", "28","jcm") => 11)
buf=Avro.write(v)
vo=Avro.read(buf,typeof(v))

output:

MethodError: Cannot `convert` an object of type Char to an object of type String
Closest candidates are:
  convert(::Type{String}, ::String) at essentials.jl:210
  convert(::Type{T}, ::T) where T<:AbstractString at strings/basic.jl:231
  convert(::Type{T}, ::AbstractString) where T<:AbstractString at strings/basic.jl:232
  ...

Stacktrace:
  [1] _totuple
    @ ./tuple.jl:316 [inlined]
  [2] Tuple{String, String, String}(itr::String)
    @ Base ./tuple.jl:303
  [3] construct(T::Type, args::String; kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:310
  [4] construct(T::Type, args::String)
    @ StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:310
  [5] construct(::Type{Tuple{String, String, String}}, ptr::Ptr{UInt8}, len::Int64; kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:435
  [6] construct(::Type{Tuple{String, String, String}}, ptr::Ptr{UInt8}, len::Int64)
    @ StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:435
  [7] readvalue(B::Avro.Binary, #unused#::Avro.StringType, #unused#::Type{Tuple{String, String, String}}, buf::Vector{UInt8}, pos::Int64, len::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:247
  [8] readvalue(B::Avro.Binary, MT::Avro.MapType, #unused#::Type{Dict{Tuple{String, String, String}, Int64}}, buf::Vector{UInt8}, pos::Int64, buflen::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Avro ~/.julia/packages/Avro/JEoRa/src/types/maps.jl:63
  [9] read(buf::Vector{UInt8}, ::Type{Dict{Tuple{String, String, String}, Int64}}; schema::Avro.MapType, jsonencoding::Bool, kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:58
 [10] read(buf::Vector{UInt8}, ::Type{Dict{Tuple{String, String, String}, Int64}})
    @ Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:58
 [11] top-level scope
    @ In[210]:3
 [12] eval
    @ ./boot.jl:360 [inlined]
 [13] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094

Solution

  • What is going wrong?

    Avro.jl is unable to properly read from the buffer into a Dict (or, as Avro calls it, a "Map") that uses a Tuple as a key because, according to the Avro specification:

    Map keys are assumed to be strings.

    This assumption is hard-coded into Avro.jl: no matter what the actual type of the Dict keys are, the code forces the key to be a String. Avro.jl does not bother to check that the key is actually a subtype of String because as long as the type can be converted to a String via the Base.string method, the code will write that string representation to the buffer. And that is exactly what is happening when you write a Dict with Tuple keys:

    v = Dict((1,2) => 3)
    buf = Avro.write(v)
    Char.(buf)
    

    This decodes the bytes in buf as ASCII/Unicode characters and prints them to the REPL. You should see the string representation of the Tuple (1,2) in there encoded as "(1, 2)":

    11-element Vector{Char}:
     '\x01': ASCII/Unicode U+0001 (category Cc: Other, control)
     '\x10': ASCII/Unicode U+0010 (category Cc: Other, control)
     '\f': ASCII/Unicode U+000C (category Cc: Other, control)
     '(': ASCII/Unicode U+0028 (category Ps: Punctuation, open)
     '1': ASCII/Unicode U+0031 (category Nd: Number, decimal digit)
     ',': ASCII/Unicode U+002C (category Po: Punctuation, other)
     ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
     '2': ASCII/Unicode U+0032 (category Nd: Number, decimal digit)
     ')': ASCII/Unicode U+0029 (category Pe: Punctuation, close)
     '\x06': ASCII/Unicode U+0006 (category Cc: Other, control)
     '\0': ASCII/Unicode U+0000 (category Cc: Other, control)
    

    The problem arises when you try to read that key back into a Tuple. When reading a key of a Map element, Avro.jl will try to read whatever is in the buffer as a String and stuff it into whatever type the key is. If the type is a Tuple of N types that can be constructed from UInt8 values (eltype(buf)), then the next N UInt8 values in the buffer will be used to create the key:

    Avro.read(buf, typeof(v))
    # Dict{Tuple{Int64, Int64}, Int64} with 1 entry:
    #   (40, 49) => 3
    

    Why 40 and 49? Because those are the Int64 representations of the Chars '(' and '1', respectively:

    Char(40)
    # '(': ASCII/Unicode U+0028 (category Ps: Punctuation, open)
    Char(49)
    # '1': ASCII/Unicode U+0031 (category Nd: Number, decimal digit)
    

    Note that this is why your second example is only reading one element in the Dict even though two are written. The two-element Tuple that is being parsed as the key is only reading the first to characters of the string representation, which are both '(' and '5' in your example. The Dict cannot have duplicate keys, so the second value simply overwrites the first.

    How to fix it

    Avoid using non-strings as keys

    Because the Avro specifications specifically state that the key of a Map is assumed to be a string, you should probably follow the specification and avoid using non-strings as keys. In my opinion, Avro.jl should not let the user write a Dict with keys that are not subtypes of AbstractString. Maybe that's a design choice, or maybe that's a bug, but it might be worth filing an issue on the project page just in case.

    Use a custom type as a key

    If you really, really want to use something other than a String as a key, Avro.jl will always convert the key to a String when it serializes a Map to a buffer using the Base.string method. During deserialization, if the code recognizes the key as a struct, it will try to pass the serialized String to the struct's constructor. Therefore all you have to do is define a custom struct with a constructor that takes a String and make it do the right thing (and optionally overload the Base.string method). Here's an example:

    struct XY
        x::Int64
        y::Int64
    end
    function XY(s::String)
        # parse the default string representation of an XY value
        # very inefficient: for demonstration purposes only
        m = match(r"XY\((\d+), (\d+)\)", s)
        XY(parse.(Int64, m.captures)...)
    end
    
    v2 = Dict(XY(1,2) => 3)
    buf2 = Avro.write(v2)
    Avro.read(buf2, typeof(v2)
    # Dict{XY, Int64} with 1 entry:
    #   XY(1, 2) => 3
    

    Write your own Tuple construct method

    If you really, really, really want to use a Tuple as a key, you can take advantage of StructType.StringType and define your own StructType.construct method. Because Avro.jl uses the unsafe pointer version, you're stuck defining the same for your Tuple. Here is an awkward example:

    function StructTypes.construct(::Type{Tuple{Int64,Int64}}, ptr::Ptr{UInt8}, len::Int; kw...)
        arr = unsafe_wrap(Vector{UInt8}, ptr, len)
        s = join(Char.(arr))
        m = findall(r"\d+", s)
        (parse(Int64, s[m[1]]), parse(Int64, s[m[2]]))
    end
    Avro.read(buf, typeof(v))
    # Dict{Tuple{Int64, Int64}, Int64} with 1 entry:
    #   (1, 2) => 3
    

    For the curious: why does Avro.jl get the value right, even if the key is parsed incorrectly?

    In Avro's binary encoding scheme, strings are serialized with their lengths stored at the beginning of the string. This allows Avro.jl to pass the known length of the string key to the pointer-based StructTypes.construct method, which passes an Array{UInt8,1} to the Tuple constructor. A fun fact about Julia is that the iterable-based constructor for a Tuple will only read as many elements from the iterable as necessary to construct the Tuple, then stop. Example:

    Tuple{Int64, Int64}([1,2,3,4])
    # (1, 2)
    

    So Avro.jl passes a 6-element Array{UInt8,1} (['(', '1', ',', ' ', '2', ')']) to the constructor of Tuple{Int64,Int64} which in turn reads only the first two elements, then returns the Tuple for Avro.jl to use as the key of the Map element. Avro.jl then skips ahead to where it knows the string ends (remember: it stores the length of the string in the buffer) and starts reading there for the value of the Map element. Avro.jl knows that value should be an Int64, and it knows how to parse an Int64, so it reads the appropriate value. Neat!