Search code examples
rsqldf

R sqldf Count, Group By, Having > 1


I have a table with 3 columns, and a few thousand records, sample is below:

df1 <- data.frame(
  ID = c('V1', 'V1', 'V1', 'V3', 'V3', 'V3', 'V4', 'V5','V5','V5'),
  Category = c('a', 'a', 'a', 'a', 'b', 'b', 'a', 'b', 'c', 'c'),
  Amount   = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1))

Need to query with sqldf to group the data by ID and Category, then sum the amount, and only return rows when the ID has more than one category. The result would look like below:

ID  Category Amount_Sum
V3  a        1
V3  b        2
V5  b        1
V5  c        2

I tried the following code which doesnt really work, and also another row_number over partition which sort of works but is way longer than it needs to be needing many additional queries.

df2 <- sqldf::sqldf("
  SELECT   Count(*) [CNT]
           [ID], 
           [Category], 
           SUM([Amount]) [amount]
  FROM     df1
  GROUP BY [ID], 
           [Category]
  Having   Count(*) > 1")

What is the best way to get that output with sqldf in R? Thanks!


Solution

  • You could add a join which restricts to only IDs having more than one category:

    SELECT t1.ID, t1.Category, t1.Amount_Sum
    FROM
    (
        SELECT ID, Category, SUM(Amount) AS Amount_Sum
        FROM df1
        GROUP BY ID, Category
    ) t1
    INNER JOIN
    (
        SELECT ID
        FROM df1
        GROUP BY ID
        HAVING COUNT(DISTINCT Category) > 1
    ) t2
        ON t2.ID = t1.ID;