I have a file (tags.csv) that contains UserId, MovieId,tags.I want to use a domain-based method to calculate the cosine similarity between tags. I want to show the relevant tags for comedy only and measure similarity for each tag relevant to the comedy tag.
dataset
My code is:
val rows = sc.textFile("/usr/local/comedy")
val vecData = rows.map(line => Vectors.dense(line.split(", ").map(_.toDouble)))
val mat = new RowMatrix(vecData)
val exact = mat.columnSimilarities()
val approx = mat.columnSimilarities(0.07)
val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, j), u) }
val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) => ((i, j), v) }
val MAE = exactEntries.leftOuterJoin(approxEntries).values.map {
case (u, Some(v)) =>
math.abs(u - v)
case (u, None) =>
math.abs(u)
}.mean()
but this error appear:
java.lang.NumberFormatException: For input string: "[1,898,"black comedy"]"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
What's wrong?
The error message is full of pertinent info.
NumberFormatException: For input string: "[1,898,"black comedy"]"
It looks like the input String
isn't being split into separate column data. So .split(", ")
isn't doing its job and it's easy to see why, there are no comma-space sequences to split on.
We could take out the space and split on just the comma but that would still leave a non-digit [
in the 1st column data and the 3rd column data has no digit characters at all.
There are a few different ways to attack this. I'd be tempted to use a regex parser.
val twoNums = "(\\d+),(\\d+),".r.unanchored
val vecData = rows.collect{ case twoNums(a, b) =>
Vectors.dense(Array(a.toDouble, b.toDouble))
}