Search code examples
f#seq

F#: grouping by recurring sequences of elements


I have a sequence of pairs (key, value) like

[("a", 1), ("a", 2), ("a", 111), ("b", 3), ("bb", 1), ("bb", -1), ...]

, what is the most effective way to convert it into sequence like

[("a", [1,2,111]), ("b", [3]), ("bb", [1,-1])] 

or similar?

The sequence has following property: it's really big (>2Gb)

This makes Seq.groupBy really ineffective and incorrect, are there any other ways to do it?

P.S.: this sequence:

[("a", 1), ("a", 2), ("a", 111), ("bb", 1), ("bb", -1), ("a", 5), ("a", 6), ...]

should be converted as

[("a", [1,2,111]), ("bb", [1,-1]), ("a", [5,6]), ...]

--

edit #1: Fixed incorrect sample

edit #2: Sequence is big, so lazy (or fastest) solution is preferred


Solution

  • If you want the option to get lazy results, then I don't think there's an elegant way without maintaining mutable state. Here's a relatively straight-forward one with mutation. You maintain a store of the last key you saw, and all the values that correspond to that:

    let s = [("a", 1); ("a", 2); ("a", 111); ("bb", 1); ("bb", -1); ("a", 5); ("a", 6)]
    let s2 = 
        [
            let mutable prevKey = None
            let mutable values = System.Collections.Generic.List<_>()
            let init key value = 
                prevKey <- Some key
                values.Clear()
                values.Add value
            for (key, value) in s do
                match prevKey with
                | None -> init key value
                | Some k when k = key -> values.Add value
                | Some k -> 
                    yield (k, List.ofSeq values)
                    init key value
            match prevKey with
            | Some k -> yield (k, List.ofSeq values)
            | _ -> ()
        ]
    

    This gives:

    val s2 : (string * int list) list =
      [("a", [1; 2; 111]); ("bb", [1; -1]); ("a", [5; 6])]
    

    For lazy evaluation, replace the [ ... ] with seq { ... }