Search code examples
juliabinaryfiles

How to write/read array of strings to .bin with good performance


Writing an array of strings to .bin format is done as follows

out =  open("string_array.bin","w")
a = ["first string","second string","third string"]
write(out,a)
close(out)

But when it comes to reading back array a, things start to get tricky.

out =  open("string_array.bin","r")
a = read(out)
close(out)
typeof(a) # returns Array{UInt8,1}

How does one convert the Array{UInt8,1} back to the original a array of type Array{String,1}?

It needs to also work when the array of strings has 300+ million elements, i.e. the solution has to be well performing.


Solution

  • So Bogumil is right, it is a bit hacky, but if you are keen to write and read to binary files, then here is an implementation for reading and writing Vector{String} that works by converting each String to Vector{UInt8}, then writing each Vector{UInt8} to file, using an initial Int64 for each Vector{UInt8} to store its length. The file also starts with an extra Int64 that stores the length of the Vector{String}. The read routines then use this information to pull it all back in and convert it back to Vector{String}:

    my_write(fid1::IOStream, x::Vector{UInt8}) = begin ; write(fid1, Int64(length(x))) ; write(fid1, x) ; end
    my_write(fid1::IOStream, x::Vector{Vector{UInt8}}) = begin ; write(fid1, Int64(length(x))) ; [ my_write(fid1, y) for y in x ] ; end
    my_read(fid1::IOStream, ::Type{Vector{UInt8}})::Vector{UInt8} = begin i = read(fid1, Int64) ; [ read(fid1, UInt8) for a = 1:i ] ; end
    my_read(fid1::IOStream, ::Type{Vector{Vector{UInt8}}})::Vector{Vector{UInt8}} = begin i = read(fid1, Int64) ; [ my_read(fid1, Vector{UInt8}) for a = 1:i ] ; end
    my_write(myfilepath::String, x::Vector{String}) = open(fid1 -> my_write(fid1, [ Vector{UInt8}(codeunits(y)) for y in x ]), myfilepath, "w")
    function my_read(myfilepath::String, ::Type{Vector{String}})::Vector{String}
        x = open(fid1 -> my_read(fid1, Vector{Vector{UInt8}}), myfilepath, "r")
        return [ String(y) for y in x ]
    end
    

    I've probably included a little more type information than is necessary, but it might make things a bit more obvious to you. Also, sorry, I have a bad habit of doing this sort of thing with one-liners, but you can easily unpack it if necessary. Here's some test code (just adjust the filepath):

    myfilepath = "/home/colin/Temp/test_file.bin"
    x = ["abc", "de", "f", "", "ghij"]
    my_write(myfilepath, x)
    my_read(myfilepath, Vector{String})
    

    Note, with a little bit of effort, this code can be made more general so that it will work for pretty much any Vector{Vector{T}} as long as T is writable. In fact, if you're really clever, it should be able to be generalized to any Vector{Vector{Vector{...{T}}}}, as long as you can get the recursion right.