Search code examples
apache-sparkmachine-learningpysparklogistic-regressiongradient-descent

How to correctly get the weights using spark for synthetic dataset?


I'm doing LogisticRegressionWithSGD on spark for synthetic dataset. I've calculated the error on matlab using vanilla gradient descent and on R which is ~5%. I got similar weight that was used in the model that I used to generate y. The dataset was generated using this example.

While I am able to get very close error rate at the end with different stepsize tuning, the weights for individual feature isn't the same. In fact, it varies a lot. I tried LBFGS for spark and it's able to predict both error and weight correctly in few iterations. My problem is with logistic regression with SGD on spark.

The weight I'm getting:

[0.466521045342,0.699614292387,0.932673108363,0.464446310304,0.231458578991,0.464372487994,0.700369689073,0.928407671516,0.467131704168,0.231629845549,0.46465456877,0.700207596219,0.935570594833,0.465697758292,0.230127949916]

The weight I want:

[2,3,4,2,1,2,3,4,2,1,2,3,4,2,1]

Intercept I'm getting: 0.2638102010832128 Intercept I want: 1

Q.1. Is it the problem with synthetic dataset. I have tried tuning with minBatchFraction, stepSize, iteration and intercept. I couldn't get it right.

Q.2. Why is spark giving me this weird weights? Would it be wrong to expect similar weights from Spark's model?

Please let me know if extra details is needed to answer my question.


Solution

  • It actually did converge, your weights are normalized between 0 and 1, while expected max value is for, multiply everything you got from SGD with 4, you can see the correlation even for intercept value.