Search code examples
f#deedle

Drop duplicates except for the first occurrence with Deedle


I have a table with one key with duplicate values. I would like to drop/reduce all duplicate keys but preserve the first row of each duplicate.

let data = "A;B\na;1\nb;\nb;2\nc;3"

let bytes = System.Text.Encoding.UTF8.GetBytes data
let stream =  new MemoryStream( bytes )

let df= 
    Frame.ReadCsv(
        stream = stream,
        separators = ";",
        hasHeaders = true
    )

df.Print()
     A B         
0 -> a 1         
1 -> b <missing> 
2 -> b 2         
3 -> c 3              

The result should be

     A B         
0 -> a 1         
1 -> b <missing>       
2 -> c 3       

I have tried applyLevel but I only get the value not the first entry:

let df1 =
    df
    |> Frame.groupRowsByString "A"
    |> Frame.applyLevel fst (fun s -> s |> Series.firstValue)

df1.Print()
     A B 
a -> a 1 
b -> b 2 <- wrong
c -> c 3 

Solution

  • This is essentially a duplicate of a previous SO question. The short answer is:

    let df1 =
        df
            |> Frame.groupRowsByString "A"
            |> Frame.nest                        // convert to a series of frames
            |> Series.mapValues (Frame.take 1)   // take the first row from each frame
            |> Frame.unnest                      // convert back to a single frame
            |> Frame.mapRowKeys snd
    df1.Print()
    

    The output is:

         A B
    0 -> a 1
    1 -> b <missing>
    3 -> c 3
    

    I've added a call to Frame.mapRowKeys at the end to match your desired output as closely as possible. Note that the actual output differs slightly from your expected output, because row 3 -> c 3 has original index 3 instead of 2. I think this is more correct, but you can renumber the rows if necessary.

    The referenced question has more details.