Search code examples
scaladata-structuresfunctional-programmingcase-classbloom-filter

How to model a bloom filter in Scala


I am trying to model a bloom filter in Scala. The logic itself is actually pretty straightforward, but I'm struggling to figure out how to use Scala's data structures adequately to make nice, idiomatic and functional.

My problem is this: if I use a case class, I need the constructor to generate the hash functions and the bits array that will store the actual bloom filter's data. But then, in a method like "add" that will change the contents of the bits array, I need to return a new bloom filter instead of mutating the contents of the existing one in order for my method to be referentially transparent.

Unfortunately I can't construct a new bloom filter, because I don't want the new one to re-create a new bits array and new hash functions, and I also can't pass it the existing ones because neither the bits array nor the hash functions are part of the bloom filter case class.

So how am I supposed to model this in Scala?


Solution

  • [ Modified to use BitSet, following comment ]

    This is an outline of how it might work.

    trait HashFunctions[T] {
      def apply(value: T): BitSet
    }
    
    object Bloom {
      class BloomFactory[T](hash: HashFunctions[T]) {
        case class Bloom(flags: BitSet) {
          def add(value: T): Bloom =
            Bloom(flags union hash(value))
          def test(value: T): Boolean =
            hash(value).subsetOf(flags)
        }
      }
    
      def apply[T](): BloomFactory[T]#Bloom = new BloomFactory(DefaultHashFunctions[T]).Bloom(BitSet.empty)
    }
    

    Note that this does create a new Bloom each time you add a value, but this makes the class immutable which is a good idea. The hash functions are created in the companion object so that this does not happen each time you add to the filter.

    Clearly this can be made significantly more efficient in both speed and memory usage.