Search code examples
f#deedle

Categorical data with Deedle


Suppose I have CSV data with a categorical variable in it, like

     Entry Color 
0 -> 1     Red   
1 -> 2     Blue  

I would like to translate the variable into a discriminated union. I have tried row.GetAs<Color>, this results in an InvalidCastException. If I use fromString/toString, I have to keep track of which variable is already cast/read from records and which is not/ read from csv data. Is there a better solution?


    #r "nuget: Deedle"
    
    open Deedle
    
    //https://stackoverflow.com/questions/21559497/create-discriminated-union-case-from-string
    module Util =
        open Microsoft.FSharp.Reflection
    
        let toString (x:'a) = 
            let (case, _ ) = FSharpValue.GetUnionFields(x, typeof<'a>)
            case.Name
    
        let fromString<'a> (s:string) =
            match FSharpType.GetUnionCases typeof<'a> |> Array.filter (fun case -> case.Name = s) with
            |[|case|] -> (FSharpValue.MakeUnion(case,[||]) :?> 'a)
            |_ -> failwith $"Unknown union case {s}"
    
    type Color =
        | Red
        | Blue
        | Green
        override this.ToString() =  Util.toString this
        static member fromString s = Util.fromString<Color> s
    
    
    let data = "Entry;Color\n1;Red\n2;Blue"
    
    //https://stackoverflow.com/questions/44344061/converting-a-string-to-a-stream/44344794
    let bytes = System.Text.Encoding.UTF8.GetBytes data
    let stream =  new MemoryStream( bytes )
    
    let df:Frame<int,string> = Frame.ReadCsv(
        stream = stream,
        separators = ";",
        hasHeaders = true
    )
    
    df.Print()
    
    //let col = df |> Frame.mapRowValues (fun row -> row.GetAs<Color>"Color") 
    //Invalid cast from 'System.String' to 'FSI_...+Color'.
    
    let col' = df |> Frame.mapRowValues (fun row -> Color.fromString (row.GetAs<string> "Color"))
    //works 
    
    df.ReplaceColumn("Color", col')
    
    df.SaveCsv(__SOURCE_DIRECTORY__ + "/df.csv",includeRowKeys=false)
    
    let df' = Frame.ReadCsv(__SOURCE_DIRECTORY__ + "/df.csv", schema="int,Color")
    
    df |> Frame.mapRowValues (fun row -> row.GetAs<Color> "Color") 
    //works
    
    df' |> Frame.mapRowValues (fun row -> row.GetAs<Color> "Color") 
    //breaks


Solution

  • Unfortunately, there is no way to tell Deedle to convert particular columns to a discriminated union when reading CSV data. (This would not really work with unions that have cases with arguments and Deedle also does not know what types are defined in your F# code.)

    The best way is something along the lines of what you are currently doing - that is, read the CSV file with categorical values as string and then parse those manually and replace the column. I would probably do this by getting the specified series and using Series.mapValues to transform the data (as that is a bit more direct than using Frame.mapRowValues):

    let df = Frame.ReadCsv(stream = stream, separators = ";", hasHeaders = true)
    let newCol = df.Columns.["Color"].As<string>() |> Series.mapValues Color.fromString
    df.ReplaceColumn("Color", newCol)