Search code examples
scalaapache-sparkextendsapache-spark-ml

Proper way to customize Spark ML estimator (e.g. GaussianMixture) by modified its private method?


My code use apache.ml.clustering.GaussianMixture, but its init method private def initRandom(...) does not work well, so I want to customize a new init method.

At first I want to "extends" class GuassianMixture, but initRandom is a private method.

Then I tried another way, it is to set initial GMM, but sadly source code says that TODO: SPARK-15785 Support users supplied initial GMM.

I've also tried to copy the code of class GuassianMixture for my custom class, but there are too many things attached to it. GaussianMixture.scala comes with sort of classes and traits, some of which are only accessible within ML packages.


Solution

  • I solved it by myself. Here is my solution.

    I created class CustomGaussianMixture which extends GaussianMixture from official package org.apache.spark.ml.clustering.

    And within my project, I created a new package, also named as org.apache.spark.ml.clustering(to prevent deal with scope of sort of complexity classes/traits/objects in org.apache.spark.ml.clustering). And place my custom class in it.

    The next thing is to override the method(fit) call initRandom, a non-private method, so I can override it. Specifically, Just write my new init method in class CustomGaussianMixture, and copy method fit from official source code in GaussianMixture.scala to class CustomGaussianMixture, remember to modify code in CustomGaussianMixture.fit() to call my custom init method.

    At last, just use CustomGaussianMixture instead of GaussianMixture when needed.