Search code examples
apache-sparkpysparkrdd

pyspark creating BlockMatrix from matrices of different size


I am trying to build a BlockMatrix,

+---+---+---+---+
|7.0|6.0|3.0|0.0|
|3.0|2.0|5.0|1.0|
|9.0|4.0|0.0|3.0|
+---+---+---+---+

from the three sub-matrices.

+---+---+
|7.0|6.0|
|3.0|2.0|
+---+---+

+---+---+
|9.0|4.0|
+---+---+

+---+---+
|3.0|0.0|
|5.0|1.0|
|0.0|3.0|
+---+---+

Here is my code.

from pyspark.mllib.linalg import Matrices
from pyspark.mllib.linalg.distributed import BlockMatrix
blocks = sc.parallelize([(0, 0, Matrices.dense(2, 2, [7,3,6,2])),
                         (2, 0, Matrices.dense(1, 2, [9,4])),
                         (0, 2, Matrices.dense(3, 2, [3.0, 5.0, 0.0, 0.0, 1.0, 3.0]))
                        ])
blockM = BlockMatrix(blocks, 2, 2)

However I got error "TypeError: Cannot convert type into a sub-matrix block tuple". Any idea where am I wrong? How to understand this blockMatrix type? Thanks!


Solution

  • TL;DR You can create BlockMatrix from such input directly.

    BlockMatrix is a regular structure - all blocks in a BlockMatrix have to be of the same maximum size. Furthermore total number of rows and columns have to be divisible by the number of rows and columns in a block respectively.

    However individual matrices can be smaller than the block - in such case data will occupy the upper right corner of the block.

    You'll have to restructure your data to match these criteria.