Search code examples
f#setsampling

Sampling in F# : is Set adequate?


I have an array of items, from which I'd like to sample.

I was under the impression that a Set would the a good structure to sample from, in a fold where I'd give back the original or a modified set with the retrieved element missing depending if I want replacement of not. However, there seems to no method to retrieve an element directly from a Set.

Is there something I am missing ? or should I use Set of indices, along with a surrogate function that starts at some random position < Set.count and goes up until it finds a member ?

That is, something along this line

module Seq =
    let modulo (n:int) start = 
        let rec next i = seq { yield (i + 1)%n ; yield! next (i+1)}
        next start

module Array =
    let Sample (withReplacement:bool) seed (entries:'T array) = 
        let prng, indexes = new Random(seed), Set(Seq.init (entries |> Array.length) id)
        Seq.unfold (fun set  -> let N = set |> Set.count
                                let next = Seq.modulo N (prng.Next(N)) |> Seq.truncate N |> Seq.tryFind(fun i -> set |> Set.exists ((=) i))
                                if next.IsSome then
                                    Some(entries.[next.Value], if withReplacement then set else Set.remove next.Value set)
                                else
                                    None)

Edit : Tracking positively what I gave, instead of tracking what I still can give would make it simpler and more efficient.


Solution

  • For sampling without replacement, you could just permute the source seq and take however many elements you want to sample

    let sampleWithoutReplacement n s =
        let a = Array.ofSeq s
        seq { for i = a.Length downto 1 do
                  let j = rnd.Next i
                  yield a.[j]
                  a.[j] <- a.[i - 1] }
        |> Seq.take n
    

    To sample with replacement, just pick a random element n times from the source seq

    let sampleWithReplacement n s =
        let a = Array.ofSeq s
        Seq.init n (fun _ -> a.[rnd.Next(a.Length)])
    

    These may not be the most efficient methods with huge data sets however