Search code examples
dataframef#deedle

Deedle F# - Find the max rows within an index group


I am trying to get the difference between the first and last values within groups in a deedle data frame. I am stuck at trying to get the max and min values for each point. I have a dataframe sorted by "TimeStamp", grouped by "Test" and with a column called "Value" of type float. For example with the following data:

Test1,2018/01/01 => 3
Test1,2018/01/02 => 5
Test1,2018/01/03 => 15
Test1,2018/01/04 => 25
Test1,2018/01/05 => 35
Test2,2018/01/01 => 5
Test2,2018/01/02 => 15
Test2,2018/01/03 => 15
Test2,2018/01/04 => 25
Test2,2018/01/05 => 30

I expect a result of

Test1 => 32    (35-3)
Test2 => 25    (30-5)

I assume I can use something like

input |> applyLevel fst Stats.maxBy |> Frame.ofRows
input |> applyLevel fst Stats.minBy |> Frame.ofRows

but I can not seem to get it to work no matter the combinations.

Thank you in advance


Solution

  • To test this, I saved your sample data in a CSV file in the following format:

    Category, Timestamp, Value
    Test1, 2018/01/01, 3
    Test1, 2018/01/02, 5
    Test1, 2018/01/03, 15
    Test1, 2018/01/04, 25
    Test1, 2018/01/05, 35
    Test2, 2018/01/01, 5
    Test2, 2018/01/02, 15
    Test2, 2018/01/03, 15
    Test2, 2018/01/04, 25
    Test2, 2018/01/05, 30
    

    Now, I think the issue with using Stats.max and Stats.min is that they return option which is None when the series is empty - this sadly makes it impossible to pass them directly to applyLevel (I admit this is a bit inconvenient design)

    However, you can pass a custom function to applyLevel, which also allows you to do both min and max at the same time:

    Frame.ReadCsv("sample.csv")
    |> Frame.groupRowsByString "Category"
    |> Frame.applyLevel fst (fun s -> 
        match Stats.min s, Stats.max s with
        | Some lo, Some hi -> hi - lo
        | _ -> nan)
    

    This reads the CSV file (which will introduce a new ordinal key), then it groups the data by category (which gets you a data frame with row key string * int) and then we aggregate all rows with the same string key using applyLevel fst. The custom function gets minimum and maximum of the value series and subtracts it (the case with nan would only happen if you had some missing values in s).