Search code examples
rapache-sparkdplyrsparkrsparklyr

Sparklyr / Dplyr Solution for counting occurrences of each value in all columns to one table


I would like to get a table from my data where the first column has a list of values that show up in the data and each additional column corresponds to a column in the data and gives the number of times that value shows up in that column.

Alternatively, I would also accept a table where each column is a column in the data and the values in the column are a list of all the distinct values that show up in that column (without the number of times).

In base R I believe you get a similar result from table(stack(my_data)), but this is a very large data set in spark so I am hoping for an efficient implementation.

Here is some sample data:

Column A Column B
46 18
23 54
54 18
23 46
46 46
97 33
21 46

And I want an output of either (i)

Column A Column B
46 18
23 54
54 46
97 33
21

(order of values in column does not matter)

or (ii)

Value Column A Column B
46 2 3
23 2 0
54 1 1
18 0 2
21 1 0
97 1 0
33 0 1

(again, order of values in column does not matter)

I realize that I could use distinct() or count() on each column and bind results, which is my current idea, but it seems like it would be more efficient to do multiple columns at once. Is there an efficient solution with data in spark?

One can assume I will only do this for columns of the same class/type to make things easier. The range of values in the columns isn't always the same though, for example a value can appear in just one column not all of them.

When I try using a SQL query, the problem is it wants a common column to join data on, so the first style output is complicated. Also, passing more than one column to distinct/count means it looks at those columns grouped, instead of separately.

Big challenge is there are a lot of columns in the data, so efficiency is a concern and that's why I'm hoping there is an answer for doing multiple columns at once.


Solution

  • Since you mention only operating on columns of the same type, in this example I use a selector for numeric data. where(is.numeric)) could be swapped out for specific columns, other conditions, or everything().

    library(tidyverse)
    
    dat <- data.frame(A = c(46,23,54,23,46,97,21),
                      B = c(18,54,18,46,46,33,46))
    
    dat %>%
      pivot_longer(where(is.numeric)) %>%
      count(name, value) %>%
      pivot_wider(names_from = name, values_from = n, values_fill = 0)
    #> # A tibble: 7 × 3
    #>   value     A     B
    #>   <dbl> <int> <int>
    #> 1    21     1     0
    #> 2    23     2     0
    #> 3    46     2     3
    #> 4    54     1     1
    #> 5    97     1     0
    #> 6    18     0     2
    #> 7    33     0     1
    

    Created on 2023-05-04 with reprex v2.0.2