I am trying to multiply each matrix row by its corresponding element in a given vector using pyspark 2.2.0.
For example in numpy
I can do this as follows:
foo = np.array([[1,2,3], [4,5,6]])
bar = np.array([[2],[3]])
bar * foo
Results in:
array([[ 2, 4, 6],
[12, 15, 18]])
Note that I don't want to do a dot product. It's simply multiplying every element in a matrix row by the corresponding element in a vector.
Is there some way of doing this in pyspark 2.2.0? I have tried multiple things but couldn't quite get what I wanted. I guess one could do it with a map
but somehow that feels wrong.
Is there some better way?
You can do it join two dataframes row by row for instance, and then use a UDF
to multiply each element of the ArrayType
by an IntegerType
:
First let's create dataframes with a row index:
foo_df = sc.parallelize(foo.tolist()).zipWithIndex().toDF()
bar_df = sc.parallelize(bar.tolist()).zipWithIndex().toDF()
Now to join them and get the final result:
import pyspark.sql.functions as psf
from pyspark.sql.types import ArrayType, IntegerType
mul = psf.udf(lambda xx,y: [x * y for x in xx], ArrayType(IntegerType()))
foo_df.join(bar_df, '_2')\
.select(mul(foo_df._1, bar_df._1[0]))\
.show()
+-------------------+
|<lambda>(_1, _1[0])|
+-------------------+
| [2, 4, 6]|
| [12, 15, 18]|
+-------------------+