Search code examples
pysparkapache-spark-ml

pyspark.ml random forest model feature importances result empty?


I am training a RandomForestClassifier in pyspark.ml and when trying to get the feature importances of the trained model via featureImportances attribute of the Estimator, I am seeing nothing in the returned tuple for the feature indices or importance weights:

(37,[],[])

I'd expect something like...

(37,[<feature indices>],[<feature importance weights>])

...(certainly not having it just be totally blank). It is odd b/c it appears to recognize that there are 37 features, but does not have any info in the other lists. Nothing in the docs seems to address this.

What could be going on here?


Solution

  • TLDR: Sparse vector is typically represented in a particular way. If your sparse vector is being printed empty, it likely means that all the values in your sparse vector are zeros.

    Checking/printing the type of the RandomForestClassificationModel Transformer's featureImportance attribute, we can see that it is a SparseVector. In most cases when a sparse vector is printed, you see something like...

    (<size>, <list of non-zero indices>, <list of non-zero values associated with the indices>)
    

    ...(if anyone has any links to documents confirming that this is how to interpret a sparse vector, do let me know b/c I can't remember how I know this or where this can be confirmed).

    An example of how SparseVectors are printed is shown below:

    from  pyspark.mllib.linalg import SparseVector
    import pprint
    a = SparseVector(5,{})
    print(a)
    # (5,[],[])
    pprint.pprint(a)
    # SparseVector(5, {})
    pprint.pprint(a.toArray())
    # array([0., 0., 0., 0., 0.])
     
     
    b = SparseVector(5,{0:1, 2:3, 4:5})
    print(b)
    # (5,[0,2,4],[1.0,3.0,5.0])
    pprint.pprint(b)
    # SparseVector(5, {0: 1.0, 2: 3.0, 4: 5.0})
    pprint.pprint(b.toArray())
    # array([1., 0., 3., 0., 5.])
    

    So if you are getting a sparse vector like (<size>, [], []) for your featureImportances, (I'm pretty sure) it means that the Estimator did not find any of your features particularly important (ie. sadly, your/my chosen features are not very good (at least from the Estimator's POV) and more data analysis is in order).