Search code examples
scalaapache-sparkdataframeapache-spark-mllibapache-spark-ml

Why does Spark MLlib HashingTF output only 1D Vectors?


So I have this big dataframe with the format:

dataframe: org.apache.spark.sql.DataFrame = [id: string, data: string]

Data is a very big set of words/indentifiers. It also contains unnecessary symbols like ["{ etc. which I need to clean up.

My solution for this clean up is:

val dataframe2 = sqlContext.createDataFrame(dataframe.map(x=> Row(x.getString(0), x.getAs[String](1).replaceAll("[^a-zA-Z,_:]",""))), dataframe.schema)

I need to apply ML to this data so it should go to the pipeline like this.

  1. First Tokenizing, which gives out

org.apache.spark.sql.DataFrame = [id: string, data: string, tokenized_data: array<string>]

with output (without the data column)

[id1,WrappedArray(ab,abc,nuj,bzu...)]

  1. StopWords Removal

org.apache.spark.sql.DataFrame = [id: string, data: string, tokenized_data: array<string>, newData: array<string>]

with output (without data and tokenized_data)

[id1,WrappedArray(ab,abc,nuj,bzu...)]

  1. HashingTF

org.apache.spark.sql.DataFrame = [id: string, data: string, tokenized_data: array<string>, newData: array<string>, hashedData: vector]

and the vector looks like this:

[id1,(262144,[236355],[1.0])]
[id2,(262144,[152325],[1.0])]
[id3,(262144,[27653],[1.0])]
[id4,(262144,[199400],[1.0])]
[id5,(262144,[82931],[1.0])]

each of the Arrays created as a result of the previous algorithms can contain from 0 up to dozens of features overall. And yet virtually all/most of my vectors are one dimensional. I want to do some clustering with this data but the 1 dimensionality is a big problem. Why is this happening and how can I fix it?

I figured out that the error happens precisely when I clean up the data. If I don't do the clean up, HashingTF performs normally. What am I doing wrong in the clean up and how can I perform a similar clean up without messing with the format?


Solution

  • [^a-zA-Z,_:] matches all whitespaces. It results in a single continuous string which when tokenized creates a single token and a Vector with one entry. You should exclude whitespaces or use regex tokenizer as replacement.