Search code examples
apache-sparkpysparktweedie

GLM with Apache Spark 2.2.0 - Tweedie family default Link value


I am using spark 2.2.0 with python. I tried to figure out what is the default param of Link function Spark accepts in the GeneralizedLineraModel in case of Tweedie family.

When I look to documentation https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression

class pyspark.ml.regression.GeneralizedLinearRegression(self, labelCol="label", featuresCol="features", predictionCol="prediction", family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None

It seems that default value when family='tweedie' should be None but when I tried this (by using similar test as unit test : https://github.com/apache/spark/pull/17146/files/fe1d3ae36314e385990f024bca94ab1e416476f2) :

from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(1.0, Vectors.dense(0.0, 0.0)),\
     (1.0, Vectors.dense(1.0, 2.0)),\
     (2.0, Vectors.dense(0.0, 0.0)),\
     (2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
glr = GeneralizedLinearRegression(family="tweedie",variancePower=1.42,link=None)
model = glr.fit(df)
transformed = model.transform(df)

it raised a Null pointer Java exception...

Py4JJavaError: An error occurred while calling o6739.w. : java.lang.NullPointerException ...

It works well when I remove explicite link=None in the initilization of the model.

from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(1.0, Vectors.dense(0.0, 0.0)),\
     (1.0, Vectors.dense(1.0, 2.0)),\
     (2.0, Vectors.dense(0.0, 0.0)),\
     (2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
glr = GeneralizedLinearRegression(family="tweedie",variancePower=1.42)
model = glr.fit(df)
transformed = model.transform(df)

I would like to be able to pass a standard set of params like

params={"family":"Onefamily","link":"OnelinkAccordingToFamily",..}

and then initialize GLM as:

 glr = GeneralizedLinearRegression(family=params["family"],link=params['link]' ....)

So it could be more standard and works in any case of family and link. Just seems that the link value is not ignored in the case when family=Tweedie any idea of what default value I should use? I tried link='' or link='None' but it raises 'invalid link function'.


Solution

  • To deal with GLR tweedie family you'll need to define the power link function specified through the "linkPower" parameter, and you shouldn't set link to None which was leading to that exception you got.

    Here is an example on how to use it :

    df = spark.createDataFrame(
            [(1.0, Vectors.dense(0.0, 0.0)),
             (1.0, Vectors.dense(1.0, 2.0)),
             (2.0, Vectors.dense(0.0, 0.0)),
             (2.0, Vectors.dense(1.0, 1.0)), ], ["label", "features"])
    
    # in this case the default link power applies
    glr = GeneralizedLinearRegression(family="tweedie", variancePower=1.6)
    
    model = glr.fit(df) # in this case the default link power applies
    
    model2 = glr.setLinkPower(-1.0).fit(df)
    

    PS : The default link power in the tweedie family is 1 - variancePower.