Search code examples
f#deedle

Deedle - Distinct by column


I had a situation the other day where a particular column of my Frame had some duplicate values.

I wanted to remove any rows where said column had a duplicate value.

I managed to hack a solution using a filter function, and while it was good enough for the exploratory data analysis at hand, it was way more painful that it should have been.

Despite searching high and low, I could not find any ideas on an elegant solution.

I also notices that Series don't offer a DistincyBy() or similar either.

How to do you do a "DistinctBy" operation for a specific column/s ?


Solution

  • One way to do it is using nest and unnest, something like this:

    let noDuplicates: Frame<(int*string), string> =
      df1
      |> Frame.groupRowsBy "Tomas"
      |> Frame.nest
      |> Series.mapValues (Frame.take 1)
      |> Frame.unnest
    

    Let's explain each step. Imagine you have this dataframe:

    // Create from individual observations (row * column * value)
    let df1 = 
      [ ("Monday", "Tomas", 1); ("Tuesday", "Adam", 2)
        ("Tuesday", "Tomas", 4); ("Wednesday", "Tomas", -5)
        ("Thursday", "Tomas", 4); ("Thursday", "Adam", 5) ]
      |> Frame.ofValues
    
                Tomas Adam      
    Monday    -> 1     <missing> 
    Tuesday   -> 4     2         
    Wednesday -> -5    <missing> 
    Thursday  -> 4     5     
    

    And you want to remove rows containing duplicate values in the "Tomas" column.

    First, group by this column.

    let df2 : Frame<(int * string), string> = df1 |> Frame.groupRowsBy "Tomas"
    
                    Tomas Adam      
    1  Monday    -> 1     <missing> 
    4  Tuesday   -> 4     2         
    4  Thursday  -> 4     5         
    -5 Wednesday -> -5    <missing> 
    

    Now you have a frame with a two-level index, which you can turn into a series of data frames.

    let df3 = df2 |> Frame.nest
    
              Tomas Adam      
    Monday -> 1     <missing> 
    
                Tomas Adam 
    Tuesday  -> 4     2    
    Thursday -> 4     5    
    
                 Tomas Adam      
    Wednesday -> -5    <missing> 
    

    Take the first row of each frame.

    let df4 = df3 |> Series.mapValues (fun fr -> fr |> Frame.take 1)
    
              Tomas Adam      
    Monday -> 1     <missing> 
    
               Tomas Adam 
    Tuesday -> 4     2    
    
                 Tomas Adam      
    Wednesday -> -5    <missing> 
    

    It remains to perform the backwards conversion: from a series of data frames into a frame with a two-level index.

    let df5 = df4 |> Frame.unnest
    
                    Tomas Adam      
    -5 Wednesday -> -5    <missing> 
    1  Monday    -> 1     <missing> 
    4  Tuesday   -> 4     2