I have a table with 3 columns, and a few thousand records, sample is below:
df1 <- data.frame(
ID = c('V1', 'V1', 'V1', 'V3', 'V3', 'V3', 'V4', 'V5','V5','V5'),
Category = c('a', 'a', 'a', 'a', 'b', 'b', 'a', 'b', 'c', 'c'),
Amount = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1))
Need to query with sqldf to group the data by ID and Category, then sum the amount, and only return rows when the ID has more than one category. The result would look like below:
ID Category Amount_Sum
V3 a 1
V3 b 2
V5 b 1
V5 c 2
I tried the following code which doesnt really work, and also another row_number over partition which sort of works but is way longer than it needs to be needing many additional queries.
df2 <- sqldf::sqldf("
SELECT Count(*) [CNT]
[ID],
[Category],
SUM([Amount]) [amount]
FROM df1
GROUP BY [ID],
[Category]
Having Count(*) > 1")
What is the best way to get that output with sqldf in R? Thanks!
You could add a join which restricts to only IDs having more than one category:
SELECT t1.ID, t1.Category, t1.Amount_Sum
FROM
(
SELECT ID, Category, SUM(Amount) AS Amount_Sum
FROM df1
GROUP BY ID, Category
) t1
INNER JOIN
(
SELECT ID
FROM df1
GROUP BY ID
HAVING COUNT(DISTINCT Category) > 1
) t2
ON t2.ID = t1.ID;