So I have this big dataframe with the format:
dataframe: org.apache.spark.sql.DataFrame = [id: string, data: string]
Data is a very big set of words/indentifiers. It also contains unnecessary symbols like ["{ etc. which I need to clean up.
My solution for this clean up is:
val dataframe2 = sqlContext.createDataFrame(dataframe.map(x=> Row(x.getString(0), x.getAs[String](1).replaceAll("[^a-zA-Z,_:]",""))), dataframe.schema)
I need to apply ML to this data so it should go to the pipeline like this.
org.apache.spark.sql.DataFrame = [id: string, data: string, tokenized_data: array<string>]
with output (without the data
column)
[id1,WrappedArray(ab,abc,nuj,bzu...)]
org.apache.spark.sql.DataFrame = [id: string, data: string, tokenized_data: array<string>, newData: array<string>]
with output (without data
and tokenized_data
)
[id1,WrappedArray(ab,abc,nuj,bzu...)]
org.apache.spark.sql.DataFrame = [id: string, data: string, tokenized_data: array<string>, newData: array<string>, hashedData: vector]
and the vector looks like this:
[id1,(262144,[236355],[1.0])]
[id2,(262144,[152325],[1.0])]
[id3,(262144,[27653],[1.0])]
[id4,(262144,[199400],[1.0])]
[id5,(262144,[82931],[1.0])]
each of the Arrays created as a result of the previous algorithms can contain from 0 up to dozens of features overall. And yet virtually all/most of my vectors are one dimensional. I want to do some clustering with this data but the 1 dimensionality is a big problem. Why is this happening and how can I fix it?
I figured out that the error happens precisely when I clean up the data. If I don't do the clean up, HashingTF performs normally. What am I doing wrong in the clean up and how can I perform a similar clean up without messing with the format?
[^a-zA-Z,_:]
matches all whitespaces. It results in a single continuous string which when tokenized creates a single token and a Vector
with one entry. You should exclude whitespaces or use regex tokenizer as replacement.