Search code examples
functional-programmingsmlsmlnjmlmosml

Is it possible to create a "generic" function in Standard ML?


I would like to create a function remove_duplicates that takes a list of any type (e.g. can be an int list or a bool list or a int list list or a whatever list) and returns the same list without duplicates, is this possible in Standard ML?


Solution

  • Is a function that takes a list of any type and returns the list without duplicates possible in Standard ML?

    No.

    To determine if one element is a duplicate of another, their values must be comparable. "Any type", or 'a in Standard ML, is not comparable for equality. So while you cannot have a val nub : 'a list -> 'a list that removes duplicates, here are four alternative options:

    1. What @qouify suggests, the built-in equality type ''a, so anything you can use = on:

      val nub : ''a list -> ''a list
      
    2. What @kopecs suggests, a function that takes an equality operator as parameter:

      val nub : ('a * 'a -> bool) -> 'a list -> 'a list
      

      Which is a generalisation of 1., since here, nub op= : ''a list -> ''a list. This solution is kind of neat since it lets you remove not only duplicates, but also redundant representatives of arbitrary equivalence classes, e.g. nub (fn (x, y) => (x mod 3) = (y mod 3)) will only preserve integers that are distinct modulo 3. But its complexity is O(n²). (-_- )ノ⌒┻━┻

    3. Because it is O(n²), nub is considered harmful.

      As the article also suggests, the alternative is to use ordering rather than equality to reduce the complexity to O(n log n). While in Haskell this means only changing the type class constraint:

      nub    :: Eq a  => [a] -> [a]
      nubOrd :: Ord a => [a] -> [a]
      

      and adjusting the algorithm, it gets a little more complicated to express this constraint in SML. While we do have ''a to represent Eq a => a (that we can use = on our input), we don't have a similar special syntax support for elements that can be compared as less/equal/greater, and we also don't have type classes. We do have the following built-in order type:

      datatype order = LESS | EQUAL | GREATER
      

      so if you like kopecs' solution, a variation with a better running time is:

      val nubOrd : ('a * 'a -> order) -> 'a list -> 'a list
      

      since it can use something like a mathematical set of previously seen elements, implemented using some kind of balanced search tree; n inserts each of complexity O(log n) takes a total of O(n log n) steps.

    4. One of SML's winner features is its composable module system. Instead of using parametric polymorphism and feeding the function nubOrd with an order comparison function, you can create a module that takes another module as a parameter (a functor).

      First, let's define a signature for modules that represent ordering of types:

      signature ORD =
      sig
        type t
        val compare : t * t -> order
      end
      

      (Notice that there isn't a ' in front of t.)

      This means that anyone could make a struct ... end : ORD by specifying a t and a corresponding compare function for ts. Many built-in types have pre-defined compare functions: int has Int.compare and real has Real.compare.

      Then, define a tree-based set data structure; I've used a binary search tree, and I've skipped most functions but the ones strictly necessary to perform this feat. Ideally you might extend the interface and use a better tree type, such as a self-balancing tree. (Unfortunately, since you've tagged this Q&A both as SML/NJ and Moscow ML, I wasn't sure which module to use, since they extend the standard library in different ways when it comes to balanced trees.)

      functor TreeSet (X : ORD) =
      struct
        type t = X.t
        datatype 'a tree = Leaf | Branch of 'a tree * 'a * 'a tree
      
        val empty = Leaf
      
        fun member (x, Leaf) = false
          | member (x, Branch (left, y, right)) =
              case X.compare (x, y) of
                   EQUAL => true
                 | LESS => member (x, left)
                 | GREATER => member (x, right)
      
        fun insert (x, Leaf) = Branch (Leaf, x, Leaf)
          | insert (x, Branch (left, y, right)) =
              case X.compare (x, y) of
                   EQUAL => Branch (left, y, right)
                 | LESS  => Branch (insert (x, left), y, right)
                 | GREATER => Branch (left, y, insert (x, right))
      end
      

      Lastly, the ListUtils functor contains the nubOrd utility function. The functor takes a structure X : ORD just like the TreeSet functor does. It creates an XSet structure by specialising the TreeSet functor using the same ordering module. It then uses this XSet to efficiently keep a record of the elements it has seen before.

      functor ListUtils (X : ORD) =
      struct
        structure XSet = TreeSet(X)
      
        fun nubOrd (xs : X.t list) =
          let
            val init = ([], XSet.empty)
            fun go (x, (ys, seen)) =
              if XSet.member (x, seen)
                then (ys, seen)
                else (x::ys, XSet.insert (x, seen))
          in rev (#1 (foldl go init xs))
          end
      end
      

      Using this functor to remove duplicates in an int list:

      structure IntListUtils = ListUtils(struct
                                           type t = int
                                           val compare = Int.compare
                                         end)
      
      val example = IntListUtils.nubOrd [1,1,2,1,3,1,2,1,3,3,2,1,4,3,2,1,5,4,3,2,1]
                                     (* [1,  2,  3,              4,      5] *)
      

      The purpose of all that mess is a nubOrd without a direct extra function parameter.

      Unfortunately, in order for this to extend to int list list, you need to create the compare function for that type, since unlike Int.compare, there isn't a generic one available in the standard library either. (This is where Haskell is a lot more ergonomic.)

      So you might go and write a generic, lexicographical list compare function: If you know how to compare two elements of type 'a, you know how to compare two lists of those, no matter what the element type is:

      fun listCompare _ ([], []) = EQUAL   (* empty lists are equal *)
        | listCompare _ ([], ys) = LESS    (* empty is always smaller than non-empty *)
        | listCompare _ (xs, []) = GREATER (* empty is always smaller than non-empty *)
        | listCompare compare (x::xs, y::ys) =
            case compare (x, y) of
                 EQUAL => listCompare compare (xs, ys)
               | LESS => LESS
               | GREATER => GREATER
      

      And now,

      structure IntListListUtils = ListUtils(struct
                                               type t = int list
                                               val compare = listCompare Int.compare
                                             end)
      val example2 = IntListListUtils.nubOrd [[1,2,3],[1,2,3,2],[1,2,3]]
                                          (* [[1,2,3],[1,2,3,2]] *)
      

      So even though [1,2,3] and [1,2,3,2] contain duplicates, they are not EQUAL when you compare them. But the third element is EQUAL to the first one, and so it gets removed as a duplicate.

    Some last observations:

    • You may consider that even though each compare is only run O(log n) times, a single compare for some complex data structure, such as a (whatever * int) list list may still be expensive. So another improvement you can make here is to cache the result of every compare output, which is actually what Haskell's nubOrdOn operator does. ┳━┳ ヽ(ಠل͜ಠ)ノ

    • The functor approach is used extensively in Jane Street's OCaml Base library. The quick solution was to pass around an 'a * 'a -> order function around every single time you nub something. One moral, though, is that while the module system does add verbosity, if you provide enough of this machinery in a standard library, it will become quite convenient.

    • If you think the improvement from O(n²) to O(n log n) is not enough, consider Fritz Henglein's Generic top-down discrimination for sorting and partitioning in linear time (2012) and Edward Kmett's Haskell discrimination package's nub for a O(n) nub.