Search code examples
scalaapache-sparkapache-spark-mllib

How to skip line in spark rdd map action based on if condition


I have a file and I want to give it to an mllib algorithm. So I am following the example and doing something like:

val data = sc.textFile(my_file).
    map {line =>

        val parts = line.split(",");
        Vectors.dense(parts.slice(1, parts.length).map(x => x.toDouble).toArray)
};

and this works except that sometimes I have a missing feature. That is sometimes one column of some row does not have any data and I want to throw away rows like this.

So I want to do something like this map{line => if(containsMissing(line) == true){ skipLine} else{ ... //same as before}}

how can I do this skipLine action?


Solution

  • You can use filter function to filter out such lines:

    val data = sc.textFile(my_file)
       .filter(_.split(",").length == cols)
       .map {line =>
            // your code
       };
    

    Assuming variable cols holds number of columns in a valid row.