Search code examples
f#deedle

How to aggregate non-float values in Deedle


Frame.aggregateRowsBy is used to perform the aggregation.
In this case, the third argument, aggFunc, will be an aggregate function such as Stats.max.
However, most of the functions defined in the Stats module seem to support only float.
How can I aggregate the maximum and minimum values as DateTime or decimal?

The following is sample code that results in a runtime error.

Frame([ "id"; "date"; "quantity"], [Series.ofValues [1; 1; 2]; Series.ofValues [ DateTime(2023, 4, 13); DateTime(2023, 4, 10); DateTime(2023, 4, 16) ]; Series.ofValues [ 2.7M; 1.2M; 1.5M; ]])
|> Frame.aggregateRowsBy ["id"] ["date"; "quantity"] Stats.max
Error: System.InvalidCastException: Invalid cast from 'DateTime' to 'Double'.

There are two things to accomplish:

  1. Aggregate non-float values (e.g. DateTime, IComparable)
  2. The post-aggregation Frame should respect the value type of the pre-aggregation. (The result of the aggregation must not be a float.)

Solution

  • One limitation of the aggregateRowsBy function is that you can only use this to aggregate columns where the values can be converted to one shared type. You can always make that obj, but then it will be very hard to do anything useful with the values. In your case, you can actually use IComparable, which will let you pass the values of the series to Seq.max:

    df 
    |> Frame.aggregateRowsBy ["id"] ["date"; "quantity"] 
        (fun (ds:Series<_, IComparable>) -> ds.Values |> Seq.max)
    

    In more complex cases, this probably won't work. Then it probably will be easier to run aggregateRowsBy multiple times (on different types of columns) - something like:

    let dq = df |> Frame.aggregateRowsBy ["id"] ["quantity"] Stats.max
    let dd = df |> Frame.aggregateRowsBy ["id"] ["date"] (fun (ds:Series<_, DateTime>) -> Seq.max ds.Values)
    dq.AddColumn("date", dd.Columns.["date"])
    dq.Print()
    

    For reference, the obj-based version would be:

    df |> Frame.aggregateRowsBy ["id"] ["date"; "quantity"] (fun (ds:Series<_, obj>) -> 
      let asFloat = ObjectSeries(ds).TryAs<float>() 
      if asFloat.HasValue then box (Stats.max asFloat.Value) else
      let asDt = ObjectSeries(ds).TryAs<DateTime>() 
      if asDt.HasValue then box (Seq.max (asDt.Value.Values))
      else failwith "Wrong column type!")