Search code examples
apache-sparksparkr

Seeing how many values in two SparkR columns match


I have two integer columns (x1 and x2) in a SparkR DataFrame named df that are very similar to each other. I want to get a count of how many of the values match and compare it with the total length of the columns. How can I do this? I have tried the following, both of which result in errors.

agg(df, sum(df$x1==df$x2))
collect(sum(df$x1==df$x2))

Solution

  • Specifically, here's the code to the answer:

    df <- withColumn(df, 'x', df$x1==df$x2)
    head(agg(groupBy(df, 'x'), x="count"))