In R programming language I can do following:
x <- c(1, 8, 3, 5, 6)
y <- rep("Down",5)
y[x>5] <- "Up"
This would result in a y vector being ("Down", "Up", "Down", "Down", "Up")
Now my x sequence is an output of the predict
function on a linear model fit. The predict
function in R returns a sequence while the predict
function in Spark returns a DataFrame containing the columns of the test-dataset + the columns label
and prediction
.
By running
y[x$prediction > .5]
I get the error:
Error in y[x$prediction > 0.5] : invalid subscript type 'S4'
How would I solve this problem?
On selecting rows:
Your approach will not work, since y
, as a product of Spark predict
, is a Spark (and not R) dataframe; you should use the filter
function of SparkR. Here is a reproducible example using the iris
dataset:
library(SparkR)
sparkR.version()
# "2.2.1"
df <- as.DataFrame(iris)
df
# SparkDataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string]
nrow(df)
# 150
# Let's keep only the records with Petal_Width > 0.2:
df2 <- filter(df, df$Petal_Width > 0.2)
nrow(df2)
# 116
Check also the example in the docs.
On replacing row values:
The standard practice for replacing row values in Spark dataframes is first to create a new column with the required condition, and then possibly dropping the old column; here is an example, where we replace values of Petal_Width
greater than 0.2
with 0's in the df
we have defined above:
newDF <- withColumn(df, "new_PetalWidth", ifelse(df$Petal_Width > 0.2, 0, df$Petal_Width))
head(newDF)
# result:
Sepal_Length Sepal_Width Petal_Length Petal_Width Species new_PetalWidth
1 5.1 3.5 1.4 0.2 setosa 0.2
2 4.9 3.0 1.4 0.2 setosa 0.2
3 4.7 3.2 1.3 0.2 setosa 0.2
4 4.6 3.1 1.5 0.2 setosa 0.2
5 5.0 3.6 1.4 0.2 setosa 0.2
6 5.4 3.9 1.7 0.4 setosa 0.0 # <- value changed
# drop the old column:
newDF <- drop(newDF, "Petal_Width")
head(newDF)
# result:
Sepal_Length Sepal_Width Petal_Length Species new_PetalWidth
1 5.1 3.5 1.4 setosa 0.2
2 4.9 3.0 1.4 setosa 0.2
3 4.7 3.2 1.3 setosa 0.2
4 4.6 3.1 1.5 setosa 0.2
5 5.0 3.6 1.4 setosa 0.2
6 5.4 3.9 1.7 setosa 0.0
The method also works along different columns; here is an example of a new column taking values 0 or Petal_Width
, depending on a condition for Petal_Length
:
newDF2 <- withColumn(df, "something_here", ifelse(df$Petal_Length > 1.4, 0, df$Petal_Width))
head(newDF2)
# result:
Sepal_Length Sepal_Width Petal_Length Petal_Width Species something_here
1 5.1 3.5 1.4 0.2 setosa 0.2
2 4.9 3.0 1.4 0.2 setosa 0.2
3 4.7 3.2 1.3 0.2 setosa 0.2
4 4.6 3.1 1.5 0.2 setosa 0.0
5 5.0 3.6 1.4 0.2 setosa 0.2
6 5.4 3.9 1.7 0.4 setosa 0.0