Search code examples
f#deedle

Return multiple columns / a dataframe in Deedle based on row-wise mapping


I want to look at each row in a frame and construct multiple columns for a new frame based on values in that row.

The final result should be a frame that has the columns of the original frame plus the new columns.

I have a solution but I wonder if there is a better one. I think the best way to explain the desired behavior is with an example. I'm using Deedle's titanic data set:

#r @"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\Deedle.1.2.3\lib\net40\Deedle.dll";;
#r @"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\FSharp.Charting.0.90.12\lib\net40\FSharp.Charting.dll";;
#r @"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\FSharp.Data.2.2.2\lib\net40\FSharp.Data.dll";;
open System
open FSharp.Data
open Deedle
open FSharp.Charting;;
#load @"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\FSharp.Charting.0.90.12\FSharp.Charting.fsx";;
#load @"F:\aolney\research_projects\braintrust\code\QualtricsToR\packages\Deedle.1.2.3\Deedle.fsx";;

let titanic = Frame.ReadCsv(@"C:\Users\aolne_000\Downloads\titanic.csv");;

This is what that frame looks like:

val titanic : Frame<int,string> =

       PassengerId Survived Pclass Name                                                Sex    Age       SibSp Parch Ticket           Fare    Cabin Embarked 
0   -> 1           False    3      Braund, Mr. Owen Harris                             male   22        1     0     A/5 21171        7.25          S        
1   -> 2           True     1      Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38        1     0     PC 17599         71.2833 C85   C        

My approach grabs each row, uses some selection logic, and then returns a new row value as a dictionary. Then I use Deedle's expansion operation to convert the values in this dictionary to new columns.

titanic?test <- titanic |> Frame.mapRowValues( fun x -> if x.GetAs<int>("Pclass") > 1 then dict ["A", 1; "B", 2] else dict ["A", 2 ; "B", 1] );;
titanic |> Frame.expandCols ["test"];;

This gives the following new frame:

       PassengerId Survived Pclass Name                                                Sex    Age       SibSp Parch Ticket           Fare    Cabin Embarked test.A test.B 
0   -> 1           False    3      Braund, Mr. Owen Harris                             male   22        1     0     A/5 21171        7.25          S        1      2      
1   -> 2           True     1      Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38        1     0     PC 17599         71.2833 C85   C        2      1      

Note the last two columns are test.A and test.B. Effectively this approach creates a new frame (A and B) and then joins the frame to the existing frame.

This is fine for my use case but it is probably confusing for others to read. Also it forces the prefix, e.g. "test", on the final columns which isn't highly desirable.

Is there a way to append the new values to the end of the row series represented in the code above by x?


Solution

  • I find your approach quite elegant and clever. Because the new series shares the index with the original frame, it is also going to be pretty fast. So, I think your solution may actually be better than the alternative option (but I have not measured this).

    Anyway, the other option would be to return new rows from your Frame.mapRowValues call - so for each row, we return the original row together with the additional columns.

    titanic 
    |> Frame.mapRowValues(fun x -> 
      let add =  
        if x.GetAs<int>("Pclass") > 1 then series ["A", box 1; "B", box 2] 
        else series ["A", box 2 ; "B", box 1]
      Series.merge x add)
    |> Frame.ofRows