Search code examples
pythonmatrixvectorpysparkmultiplication

How to multiply matrix row by vector element in pypark 2.2.0


I am trying to multiply each matrix row by its corresponding element in a given vector using pyspark 2.2.0.

For example in numpy I can do this as follows:

foo = np.array([[1,2,3], [4,5,6]])
bar = np.array([[2],[3]])
bar * foo

Results in:

array([[ 2,  4,  6],
       [12, 15, 18]])

Note that I don't want to do a dot product. It's simply multiplying every element in a matrix row by the corresponding element in a vector.

Is there some way of doing this in pyspark 2.2.0? I have tried multiple things but couldn't quite get what I wanted. I guess one could do it with a map but somehow that feels wrong.

Is there some better way?


Solution

  • You can do it join two dataframes row by row for instance, and then use a UDF to multiply each element of the ArrayType by an IntegerType:

    First let's create dataframes with a row index:

    foo_df = sc.parallelize(foo.tolist()).zipWithIndex().toDF()
    bar_df = sc.parallelize(bar.tolist()).zipWithIndex().toDF()
    

    Now to join them and get the final result:

    import pyspark.sql.functions as psf
    from pyspark.sql.types import ArrayType, IntegerType
    mul = psf.udf(lambda xx,y: [x * y for x in xx], ArrayType(IntegerType()))
    foo_df.join(bar_df, '_2')\
        .select(mul(foo_df._1, bar_df._1[0]))\
        .show()
    
        +-------------------+
        |<lambda>(_1, _1[0])|
        +-------------------+
        |          [2, 4, 6]|
        |       [12, 15, 18]|
        +-------------------+