Sas has a procedure called rank that assigns a "rank" to each row in a dataframe according to the position in an ordered set of a variable, kind of; but the rank is not just the position: one has to tell the procedure how many groups use in the ranking. The rank is actually the group to which the row belongs.
In SQL terms, this is called a dense ranking.
Example (the salary variable is included for generality, but it is not used in this example):
Say we have this data frame:
If we rank by age using 4 groups, sas would give us this:
It is easier to understand what happened if we sort the data by the variable we ranked:
Now we can see why rank gives us the position in an ordered set, kind of.
The rank procedure is very useful and cool, but I could't find in Deedle's doc how to perform it. Is there direct way to do it in Deedle or I need to create my own extension?
I suppose I could do it using these functions:
SortRows(frame, key)
chunk size series
I wrote my own extension:
type Frame<'TRowKey, 'TColumnKey
when 'TRowKey : equality
and 'TColumnKey : equality> with
static member denseRank column (groups:int) rankName frame =
let frameLength = Frame.countRows frame |> float
let chunkSize = frameLength / (float groups) |> Math.Ceiling |> int64
let sorted =
frame
|> Frame.sortRows column
let ranks =
Frame.getCol column frame
|> Series.map(fun k _ ->
int ((Frame.indexForKey k sorted) / chunkSize)
)
let clone = frame.Clone()
clone.AddColumn(rankName, ranks)
clone
where indexForKey is this other custom extension:
// index for row with key
// index starting at 0
static member indexForKey (key:'K) (frame:Frame<'K,_>) : int64 =
frame.RowIndex.Locate key
|> frame.RowIndex.AddressOperations.OffsetOf
I tried this other definition hoping that it would run faster. It is slightly faster, but not by a lot; any comments on performance issues are welcomed:
static member denseRank column (groups:int) rankName frame =
let frameLength = Frame.countRows frame
let chunkSize = (float frameLength) / (float groups) |> Math.Ceiling
let sorted =
frame
|> Frame.sortRows column
let sortedKeys = Frame.getRowKeys sorted
let ranksArr = Array.zeroCreate frameLength
sortedKeys
|> Seq.iteri (fun index _ -> ranksArr.[index] <- index / (int chunkSize))
let ranks = Series(sortedKeys, ranksArr)
let clone = frame.Clone()
clone.AddColumn(rankName, ranks)
clone