Search code examples
f#deedle

Suggestion for fast performance expanding apply with deedle


The Stats.expandingXXXX functions are pretty fast. However there is no public api to do a expandingWindow apply. The following solution i created is really slow when it comes to large dataset like 100k. Any suggestion is appreciated?

 let ExpWindowApply f minSize data = 
        let keys = dataSeries.Keys
        let startKey = dataSeries.FirstKey()
        let values = keys
                     |> Seq.map(fun k -> 
                                    let ds = data.Between(startKey,k) 
                                    match ds with 
                                    |_ when ds.ValueCount >= minSize -> f ds.Values
                                    |_ -> Double.NaN
                                )                     
        let result = Series(keys, values)
        result

I understand the Stats.expandingXXX function are actually special cases where the function being applied can be iterately calculated based on previous loop's state. And not all function can take advantage of states from previous calculation. Is there anything better way than Series.Between in terms of creating a window of data?

Update

For those who are also interested in the similar issue. The answer provides alternative implementation and insight into rarely documented series vector and index operation. But it doesn't improve performance.


Solution

  • The expanding functions in Deedle are fast because they are using an efficient online algorithm that makes it possible to calculate the statistics on the fly with just one pass - rather than actually building the intermediate series for the sub-ranges.

    There is a built-in function aggregate that lets you do something this - though it works in the reversed way. For example, if you want to sum all elements starting from the current one to the end, you can write:

    let s = series [ for i in 1 .. 10 -> i, float i ]
    
    s |> Series.aggregateInto
            (Aggregation.WindowWhile(fun _ _ -> true))
            (fun seg -> seg.Data.FirstKey()) 
            (fun seg -> OptionalValue(Stats.sum seg.Data))
    

    If you want to do the same thing using the underlying representation, you can directly use the addressing scheme that Deedle uses to link the keys (in the index) with values (in the data vector). This is an ugly mutable sample, but you can encapsulate it into something nicer:

    [ let firstAddr = s.Index.Locate(s.FirstKey())
      for k in s.Index.KeySequence ->
        let lastAddr = s.Index.Locate(k)
        seq { 
          let a = ref firstAddr
          while !a <> lastAddr do
            yield s.Vector.GetValue(!a).Value
            a := s.Index.AddressOperations.AdjustBy(!a, +1L) } |> Seq.sum ]