Search code examples
hadoopmatrixscalding

Transforming matrix format, scalding


Ok, so, in scalding we can easily work with matrix, using matrix api, and it is ok - in a such way:

val matrix = Tsv(path, ('row, 'col, 'val))
  .read
  .toMatrix[Long,Long,Double]('row, 'col, 'val)

But how can I transform matrix to that format from format, like we usually write? Are there some elegant ways?

1 2 3
3 4 5
5 6 7

to

1 1 1
1 2 2
1 3 3
2 1 3
2 2 4
2 3 5
3 1 5
3 2 6
3 3 7

I need this to make operations on matrix with huge sizes, and I don't know the number of rows and columns (it is possible to give sizes if file? NxM for example).

I tried to make smth with TextLine( args("input") ) but i dunno how to count line number. I want to convert matrix on hadoop, mb there r other ways how to deal with format? Is it possible with scalding?


Solution

  • The below answer is not mine but OP's answer, which was put in the question.


    Here's what I've done, which outputs what I wanted:

    var prev: Long = 0
    var pos: Long = 1
    
    val zeroInt = 0
    val zeroDouble = 0.0
    
    TextLine( args("a") )
        .flatMap('line -> 'number)  { line : String => line.split("\\s+") }
        .mapTo(('offset, 'line, 'number) -> ('row, 'val)) { 
          (offset: Long, line: String, number: String) => 
            pos = if(prev == (offset + 1)) pos + 1 else 1
            prev = offset + 1
            (offset + 1, number) }
        .filter('row, 'col, 'v) { 
          (row: Long, col: String, v: String) => 
            val (row, col, v) = line
            (v != zeroInt.toString) && (v != zeroDouble.toString) }
        .write(Tsv(args("c")))