Problem Description:
I have a dataset which is about 35 millons rows and 10 columns .
I want to calculate the distance between two rows, which the distancefunction like distance(row1,row2)
, and then store the value in a huge matrix.
The operations totally needed are nearly 6*10^15, which i think is very huge.
What I've tried :
df.collect()
and get a array1 :array[Row]
array1
pair-wisely and calculate distancedistance(rowi,rowj)
in matrix(i,j)Scala code :
val array1 = df.collect()
val l = array1.length
for(i <-0 until array.length){
for(j <-i+1 until array.length){
val vd = Vectors.dense(i,j,distance(array(i),array(j)))
I want to save each value in Vector like above, and add it to RDD/Dataframe.
But the only way i've searched is by using union
.I think it's not good enough.
So there are three questions need to be solved:
collect
is an action function, df.collect()
will throw Exception
java.lang.OutOf.MemoryError : Java heap space
. Can this be avoided?distance(rowi,rowj)
, i want to store it, how?ps: If above all can't be solved, which new idea can i use?
Any answer will help me a lot ,thank you!
Check https://spark.apache.org/docs/latest/mllib-data-types.html#indexedrowmatrix IndexedRowMatrix. An IndexedRowMatrix is similar to a RowMatrix but with meaningful row indices. You can design your algorithm based in this APi.