python matrix vector pyspark multiplication

How to multiply matrix row by vector element in pypark 2.2.0

I am trying to multiply each matrix row by its corresponding element in a given vector using pyspark 2.2.0.

For example in numpy I can do this as follows:

foo = np.array([[1,2,3], [4,5,6]])
bar = np.array([[2],[3]])
bar * foo

Results in:

array([[ 2,  4,  6],
       [12, 15, 18]])

Note that I don't want to do a dot product. It's simply multiplying every element in a matrix row by the corresponding element in a vector.

Is there some way of doing this in pyspark 2.2.0? I have tried multiple things but couldn't quite get what I wanted. I guess one could do it with a map but somehow that feels wrong.

Is there some better way?

Solution

You can do it join two dataframes row by row for instance, and then use a UDF to multiply each element of the ArrayType by an IntegerType:

First let's create dataframes with a row index:

foo_df = sc.parallelize(foo.tolist()).zipWithIndex().toDF()
bar_df = sc.parallelize(bar.tolist()).zipWithIndex().toDF()

Now to join them and get the final result:

import pyspark.sql.functions as psf
from pyspark.sql.types import ArrayType, IntegerType
mul = psf.udf(lambda xx,y: [x * y for x in xx], ArrayType(IntegerType()))
foo_df.join(bar_df, '_2')\
    .select(mul(foo_df._1, bar_df._1[0]))\
    .show()

    +-------------------+
    |<lambda>(_1, _1[0])|
    +-------------------+
    |          [2, 4, 6]|
    |       [12, 15, 18]|
    +-------------------+