Search code examples
haskelldata-structuresdisjoint-setsunion-find

Do I need to take explicit actions to facilitate sharing with persistent data structures?


I come from an imperative background and am trying to implement a simple disjoint sets (“union-find”) data structure to get some practice with creating and modifying (persistent) data structures in Haskell. The goal is to have a simple implementation, but I am also concerned about efficiency, and my question is related to this.

First, I created a disjoint-set forest implementation with union by rank and started by defining a data type for a “point”:

data Point = Point
  { _value  :: Int
  , _parent :: Maybe Point
  , _rank   :: Int
  } deriving Show

A disjointed set forest is an IntMap with Int → Point mappings:

type DSForest = IntMap Point

empty :: DSForest
empty = I.empty

A singleton set is simply a mapping from its value x to a Point with value x, no parent and a rank of 1:

makeSet :: DSForest -> Int -> DSForest
makeSet dsf x = I.insert x (Point x Nothing 0) dsf

Now, the interesting part – union. This operation will modify a point by setting the other point as its parent (and in some cases change its rank). In the case where the Points' rank are different, the Point is simply “updated” (a new Point is created) to have its parent point to the other. In the case where they are equal, a new Point is created with its rank increased by one:

union :: DSForest -> Int -> Int -> DSForest
union dsf x y | x == y = dsf 
union dsf x y = 
   if _value x' == _value y'
      then dsf 
      else case compare (_rank x') (_rank y') of
                  GT -> I.insert (_value y') y'{ _parent = Just x' } dsf 
                  LT -> I.insert (_value x') x'{ _parent = Just y' } dsf 
                            -- 1) increase x's rank by one:
                  EQ -> let x''  = x'{ _rank = _rank x' + 1 } 
                            -- 2) update the value for x's rank to point to the new x:
                            dsf' = I.insert (_value x'') x'' dsf 
                            -- 3) then update y to have the new x as its parent:
                         in I.insert (_value y') y'{ _parent = Just x'' } dsf'
  where x' = dsf ! findSet dsf x
        y' = dsf ! findSet dsf y

Now, to my real question, if in the EQ case I had instead done the following:

EQ -> let dsf' = I.insert (_value x') x'{ _rank = _rank x' + 1} dsf 
       in I.insert (_value y') y'{ _parent = Just x'{ _rank = _rank x' + 1 }} dsf'

I.e. first insert a new Point x with its rank increased, and then having y''s parent be a new Point x with its rank increased, would this mean that they no longer point to the same Point in memory? (Does this even matter? Should I worry about these things when using/creating persistent data structures?)

And just for completeness, here is findSet:

findSet :: DSForest -> Int -> Int
findSet dsf' x' = case _parent (dsf' ! x') of
                     Just (Point v _ _)  -> findSet dsf' v
                     Nothing             -> x'

(General comments about the efficiency and design of this code are also welcome.)


Solution

  • Sharing is a compiler thing. When it recognizes common sub-expressions, a compiler may chose to represent them both by the same object in memory. But even if you use such a compiler switch (like -fno-cse), it is under no obligation to do so, and the two might be (and usually are, in the absence of the switch) represented by two different, though of equal value, objects in memory. Re: referential transparency.

    OTOH when we name something and use that name twice, we (reasonably) expect it to represent the same object in memory. But compiler might choose to duplicate it and use two separate copies in two different use sites, although it is not known to do so. But it might. Re: referential transparency.

    See also:


    Here's few examples with list-producing functions, drawing from the last link above. They rely on the compiler not duplicating anything, i.e. indeed sharing any named object as expected from call by need lambda calculus operational semantics (as explained by nponeccop in the comments), and not introducing any extra sharing on its own to eliminate common subexpressions:

    1. Sharing fixpoint combinator, creating a loop:

      fix f = x where x = f x

    2. Non-sharing fixpoint combinator, creating telescoping multistage chain (i.e. regular recursion chain)

      _Y f = f (_Y f)

    3. Two-stages combination - a loop and a feed

      _2 f = f (fix f)