Search code examples
sasdeedledense-rank

How to perform "ranking" (as in SAS or a "dense rank" as in SQL) in Deedle


Sas has a procedure called rank that assigns a "rank" to each row in a dataframe according to the position in an ordered set of a variable, kind of; but the rank is not just the position: one has to tell the procedure how many groups use in the ranking. The rank is actually the group to which the row belongs.

In SQL terms, this is called a dense ranking.

Example (the salary variable is included for generality, but it is not used in this example):

Say we have this data frame:

Original

If we rank by age using 4 groups, sas would give us this: ranked

It is easier to understand what happened if we sort the data by the variable we ranked: ranked and sorted

Now we can see why rank gives us the position in an ordered set, kind of.

The rank procedure is very useful and cool, but I could't find in Deedle's doc how to perform it. Is there direct way to do it in Deedle or I need to create my own extension?

I suppose I could do it using these functions:

SortRows(frame, key)
chunk size series

Solution

  • I wrote my own extension:

    type Frame<'TRowKey, 'TColumnKey
            when 'TRowKey : equality
            and 'TColumnKey : equality> with
        static member denseRank column (groups:int) rankName frame =
            let frameLength = Frame.countRows frame |> float
            let chunkSize =  frameLength / (float groups) |> Math.Ceiling |> int64
    
            let sorted =
                frame
                |> Frame.sortRows column
    
            let ranks =
                Frame.getCol column frame
                |> Series.map(fun k _ ->
                    int ((Frame.indexForKey k sorted) / chunkSize)
                )
    
            let clone = frame.Clone()
            clone.AddColumn(rankName, ranks)
            clone
    

    where indexForKey is this other custom extension:

        // index for row with key
        // index starting at 0
        static member indexForKey (key:'K) (frame:Frame<'K,_>) : int64 =
            frame.RowIndex.Locate key
            |> frame.RowIndex.AddressOperations.OffsetOf
    

    I tried this other definition hoping that it would run faster. It is slightly faster, but not by a lot; any comments on performance issues are welcomed:

    static member denseRank column (groups:int) rankName frame =
            let frameLength = Frame.countRows frame
            let chunkSize =  (float frameLength) / (float groups) |> Math.Ceiling
    
            let sorted =
                frame
                |> Frame.sortRows column
    
            let sortedKeys = Frame.getRowKeys sorted
    
            let ranksArr = Array.zeroCreate frameLength
    
            sortedKeys
            |> Seq.iteri (fun index _ -> ranksArr.[index] <- index / (int chunkSize))
    
            let ranks = Series(sortedKeys, ranksArr)
            let clone = frame.Clone()
            clone.AddColumn(rankName, ranks)
            clone