Search code examples
f#deedle

Data processing using F# and Deedle


Can someone please help in translating the following F# pseudo-code into working code using Deedle?

open Deedle

..
//df = Frame<DateTime, string>(...), int values, keys ordered

let mutable leftKey = leftmostKey
seq {
  for row in df.Rows do
    let columnKeyMax = row[leftKey..] |> findColumnKeyContainingMaxValue
    if columnKeyMax <> leftKey then 
      leftKey <- columnKeyMax
      yield (row.Key, columnKeyMax)
}

Given the following dataframe, I am looking to get the coordinates of the marked cells:

["1/11/2010 22:00", "GCA2010Dec"; "29/11/2010 23:00", "GCA2011Feb";...]

let mutable currentColumnKey, currentMax = find max in first row //"GCA2010Dec, 113517"
for each row after first
  newColumnKey, newMax = try find the first column on the right of currentColumnKey with value > row[currentColumnKey]
  if found then yield (row.Key, newColumnKey)
  currentColumnKey <- newColumnKey
  currentMax <- row[newColumnKey]

columns keys are future contracts row keys are trading sessions, content is daily volume. The algorithm aims to find roll pairs (date, contract). At the beginning there is no current contract so the maximum volume in the first row is taken => first roll is ("1/11/2010 22:00", "GCA2010Dec"). Then rows are parsed until a contract is found right of the current and with greater volume. 29/11/2010 is the first date where on the right of current contract (GCA2010Dec) there is another contract (GCA2011Feb) with a greater volume (110942 > 27451). This generates a new roll pair ("29/11/2010 23:00", "GCA2011Feb"). So on until all rows are parsed

enter image description here


Solution

  • I think there is not much clever you can do here with Deedle. It ends up being an iteration over the data in both dimensions (and it may be easier to do this using a plain 2D array). However, here is a Deedle-based solution.

    First, I loaded the data, indexed it by keys and parsed into int so that Deedle handles <missing> correctly for me:

    let df = 
      Frame.ReadCsv(@"C:\Users\tomas\Downloads\df.csv") 
      |> Frame.indexRowsDate "Key"
      |> Frame.mapValues (fun (o:string) -> 
        if String.IsNullOrEmpty o then Nullable() else Nullable(Int32.Parse o ))
    

    The key function is then a recursive one (to implement iteration over rows). One clever thing you can do is to notice that the column keys are also ordered and so you can easily get a sub-frame with only columns on the right using subdf.Columns.[ck ..]:

    let rec loop ck (rowKeys:DateTime list) (subdf:Frame<_, _>)= 
      match rowKeys with 
      | [] -> ()
      | rk::rowKeys ->
          // Get all the non-missing values in the current row
          let obs = subdf.Rows.[rk].As<int>().Observations
          let (KeyValue(_, current)) = obs |> Seq.head
          printfn "Current date '%A', column '%s' has: %A" rk ck current
    
          // See if there is a greater one in later column
          let found = obs |> Seq.tail |> Seq.tryFind (fun (KeyValue(_, v)) -> v > current)
    
          // If no, stay on the current column. If yes, take a sub-frame
          // and continue scanning that recursively
          match found with 
          | None -> loop ck rowKeys subdf
          | Some(KeyValue(ck, v)) ->
              printfn "---> Found new CK: %s with value %A greater than %A" ck v current
              loop ck rowKeys subdf.Columns.[ck ..]
    

    To start looking at the range from your screenshot:

    let df2010 = df.Rows.[DateTime(2010,11,1) .. DateTime(2011,12,3)]
    let rowKeys = df2010.RowKeys |> List.ofSeq
    let ck = "GCA201012"
    loop ck rowKeys df2010.Columns.[ck ..]