Search code examples
machine-learningjuliaflux.jl

How to feed a dictionary to a Flux model in Julia


So I have a 20000x4 dataset, where the 4 columns have strings. The first is a description and the other three are categories, the last one being the one I wish to predict. I tokenized every word of the first column and saved it in a dictionary, with his respective Int value, and I changed the other columns to have numerical values. Now I'm having trouble to understand how to feed these data in a Flux model.

According to the documentation, I have to use a "collection of data to train with (usually a set of inputs x and target outputs y)". In the example, it separates the data x and y. But how can I make that with a dictionary plus two numeric columns?

Edit:

Here is a minimal example of what I have right now:

using WordTokenizers
using DataFrames

dataframe = DataFrame(Description = ["It has pointy ears", "It has round ears"], Size = ["Big", "Small"], Color = ["Black", "Yellow"], Category = ["Dog", "Cat"])

dict_x = Dict{String, Int64}()
dict_y = Dict{String, Int64}()

function words_to_numbers(data, column, dict)
    i = 1
    for row in range(1, stop=size(data, 1))
        array_of_words = tokenize(data[row, column])
        for (index, word) in enumerate(array_of_words)
            if haskey(dict, word)
                continue
            else
                dict[word] = i
                i += 1
            end
        end
    end
end

function categories_to_numbers(data, column, dict)
    i = 1
    for row in range(1, stop=size(data, 1))
        if haskey(dict, data[row, column])
            continue
        else
            dict[data[row, column]] = i
            i += 1
        end
    end
end

words_to_numbers(dataframe, 1, dict_x)
categories_to_numbers(dataframe, 4, dict_y)

I want to use dict_x and dict_y as my input and output for a Flux model


Solution

  • Consider this example:

    using DataFrames
    
    df = DataFrame()
    df.food = rand(["apple", "banana", "orange"], 20)
    
    multiplier(fruit) = (1 + (0.1 * rand())) * (fruit == "apple" ? 95 : 
        fruit == "orange" ? 45 : 105)
    foodtoken(f) = (fruit == "apple" ? 0 : fruit == "orange" ? 2 : 3)
    
    df.calories = multiplier.(df.food)
    foodtoken(f) = (fruit == "apple" ? 0 : fruit == "orange" ? 2 : 3)
    
    fooddict = Dict(fruit => (fruit == "apple" ? 0 : fruit == "orange" ? 2 : 3)
        for fruit in df.food)
    

    Now we can add the token numeric values to the dataframe:

    df.token = map(x -> fooddict[x], df.food)
    
    println(df)
    

    Now you should be able to run the prediction with df.token as an input and df.calories as an output.

    ========== addendum after you posted further code: ===========

    With your modified example, you just need a helper function:

    function colvalue(s, dict)
        total = 0
        for (k, v) in dict
            if occursin(k, s)
                total += 10^v
            end
        end
        total
    end
    
    
    words_to_numbers(dataframe, 1, dict_x)
    categories_to_numbers(dataframe, 4, dict_y)
    
    dataframe.descripval = map(x -> colvalue(x, dict_x), dataframe.Description)
    dataframe.catval = map(x -> colvalue(x, dict_y), dataframe.Category)
    
    println(dataframe)