Search code examples
apache-sparkapache-spark-sqlsparkr

How to select rows and assign them new values with SparkR?


In R programming language I can do following:

x <- c(1, 8, 3, 5, 6)
y <- rep("Down",5)
y[x>5] <- "Up"

This would result in a y vector being ("Down", "Up", "Down", "Down", "Up")

Now my x sequence is an output of the predict function on a linear model fit. The predict function in R returns a sequence while the predict function in Spark returns a DataFrame containing the columns of the test-dataset + the columns label and prediction.

By running

y[x$prediction > .5]

I get the error:

Error in y[x$prediction > 0.5] : invalid subscript type 'S4'

How would I solve this problem?


Solution

  • On selecting rows:

    Your approach will not work, since y, as a product of Spark predict, is a Spark (and not R) dataframe; you should use the filter function of SparkR. Here is a reproducible example using the iris dataset:

    library(SparkR)
    sparkR.version()
    # "2.2.1"
    
    df <- as.DataFrame(iris)
    df
    # SparkDataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string]
    nrow(df)
    # 150
    
    # Let's keep only the records with Petal_Width > 0.2:
    df2 <- filter(df, df$Petal_Width > 0.2)    
    nrow(df2)
    # 116
    

    Check also the example in the docs.

    On replacing row values:

    The standard practice for replacing row values in Spark dataframes is first to create a new column with the required condition, and then possibly dropping the old column; here is an example, where we replace values of Petal_Width greater than 0.2 with 0's in the df we have defined above:

    newDF <- withColumn(df, "new_PetalWidth", ifelse(df$Petal_Width > 0.2, 0, df$Petal_Width))
    head(newDF)
    # result:
      Sepal_Length Sepal_Width Petal_Length Petal_Width Species new_PetalWidth
    1          5.1         3.5          1.4         0.2  setosa            0.2
    2          4.9         3.0          1.4         0.2  setosa            0.2
    3          4.7         3.2          1.3         0.2  setosa            0.2
    4          4.6         3.1          1.5         0.2  setosa            0.2
    5          5.0         3.6          1.4         0.2  setosa            0.2
    6          5.4         3.9          1.7         0.4  setosa            0.0 # <- value changed
    
    # drop the old column:
    newDF <- drop(newDF, "Petal_Width")
    head(newDF)
    # result:
      Sepal_Length Sepal_Width Petal_Length Species new_PetalWidth
    1          5.1         3.5          1.4  setosa            0.2
    2          4.9         3.0          1.4  setosa            0.2
    3          4.7         3.2          1.3  setosa            0.2
    4          4.6         3.1          1.5  setosa            0.2
    5          5.0         3.6          1.4  setosa            0.2
    6          5.4         3.9          1.7  setosa            0.0
    

    The method also works along different columns; here is an example of a new column taking values 0 or Petal_Width, depending on a condition for Petal_Length:

    newDF2 <- withColumn(df, "something_here", ifelse(df$Petal_Length > 1.4, 0, df$Petal_Width))
    head(newDF2)
    # result:
      Sepal_Length Sepal_Width Petal_Length Petal_Width Species something_here
    1          5.1         3.5          1.4         0.2  setosa            0.2
    2          4.9         3.0          1.4         0.2  setosa            0.2
    3          4.7         3.2          1.3         0.2  setosa            0.2
    4          4.6         3.1          1.5         0.2  setosa            0.0
    5          5.0         3.6          1.4         0.2  setosa            0.2
    6          5.4         3.9          1.7         0.4  setosa            0.0