I had a situation the other day where a particular column of my Frame
had some duplicate values.
I wanted to remove any rows where said column had a duplicate value.
I managed to hack a solution using a filter function, and while it was good enough for the exploratory data analysis at hand, it was way more painful that it should have been.
Despite searching high and low, I could not find any ideas on an elegant solution.
I also notices that Series
don't offer a DistincyBy()
or similar either.
How to do you do a "DistinctBy" operation for a specific column/s ?
One way to do it is using nest
and unnest
, something like this:
let noDuplicates: Frame<(int*string), string> =
df1
|> Frame.groupRowsBy "Tomas"
|> Frame.nest
|> Series.mapValues (Frame.take 1)
|> Frame.unnest
Let's explain each step. Imagine you have this dataframe:
// Create from individual observations (row * column * value)
let df1 =
[ ("Monday", "Tomas", 1); ("Tuesday", "Adam", 2)
("Tuesday", "Tomas", 4); ("Wednesday", "Tomas", -5)
("Thursday", "Tomas", 4); ("Thursday", "Adam", 5) ]
|> Frame.ofValues
Tomas Adam
Monday -> 1 <missing>
Tuesday -> 4 2
Wednesday -> -5 <missing>
Thursday -> 4 5
And you want to remove rows containing duplicate values in the "Tomas" column.
First, group by this column.
let df2 : Frame<(int * string), string> = df1 |> Frame.groupRowsBy "Tomas"
Tomas Adam
1 Monday -> 1 <missing>
4 Tuesday -> 4 2
4 Thursday -> 4 5
-5 Wednesday -> -5 <missing>
Now you have a frame with a two-level index, which you can turn into a series of data frames.
let df3 = df2 |> Frame.nest
Tomas Adam
Monday -> 1 <missing>
Tomas Adam
Tuesday -> 4 2
Thursday -> 4 5
Tomas Adam
Wednesday -> -5 <missing>
Take the first row of each frame.
let df4 = df3 |> Series.mapValues (fun fr -> fr |> Frame.take 1)
Tomas Adam
Monday -> 1 <missing>
Tomas Adam
Tuesday -> 4 2
Tomas Adam
Wednesday -> -5 <missing>
It remains to perform the backwards conversion: from a series of data frames into a frame with a two-level index.
let df5 = df4 |> Frame.unnest
Tomas Adam
-5 Wednesday -> -5 <missing>
1 Monday -> 1 <missing>
4 Tuesday -> 4 2