I am trying to subset flights table without the use of nested queries using sqldf library in R. I have calculated median and average values using one query and stored it into different tables. In the end, I am trying to apply WHERE condition to solve but I am not able to understand why the JOIN function is behaving the way it is. I have two questions.
Can anyone explain what is going on here?
library(sqldf)
average = sqldf('SELECT AVG(distance) AS avg FROM flights')
median = sqldf('SELECT MEDIAN(distance) AS med FROM flights')
result = sqldf(
"SELECT flights.*
FROM flights
INNER JOIN average_q1 ON flights.distance
INNER JOIN median_q1 ON flights.distance
WHERE distance BETWEEN median_q1.med AND average_q1.avg"
)
1) The code in the question is not reproducible (see posting instructions at the top of the r tag page) so we will use the following:
library(sqldf)
# test inputs
DF <- data.frame(a = c(1, 1, 1, 2, 2, 2), b = 1:6)
DFsum <- sqldf("select a, sum(b) as sum from DF group by a")
sqldf("select *
from DF A
left join DFsum B using(a)")
giving:
a b sum
1 1 1 6
2 1 2 6
3 1 3 6
4 2 4 15
5 2 5 15
6 2 6 15
2) This can be done using a window function which avoids the second table:
sqldf("select *, sum(b) over (partition by a) as sum from DF")
3) The aggregation can be combined right into the SQL statement like this:
sqldf("select *
from DF A
left join (select a, sum(b) as sum
from DF
group by a) using(a)")
4) This alternative is a different way of combining the SQL statements using a with
clause, also known as a Common Table Expression (CTE)
sqldf("with B as (
select a, sum(b) as sum
from DF
group by a
)
select *
from DF A
left join B using(a)")