Search code examples
sqlrsqldf

How to perform SQL join with a table and another table having an aggregate value of one of the column of first table?


I am trying to subset flights table without the use of nested queries using sqldf library in R. I have calculated median and average values using one query and stored it into different tables. In the end, I am trying to apply WHERE condition to solve but I am not able to understand why the JOIN function is behaving the way it is. I have two questions.

  1. How is it able to perform INNER JOIN without two keys specified after ON argument?
  2. It is basically putting median and average value in all the rows after JOINS. Why?

Can anyone explain what is going on here?

library(sqldf)

average = sqldf('SELECT AVG(distance) AS avg FROM flights')

median = sqldf('SELECT MEDIAN(distance) AS med FROM flights')

result = sqldf(
    "SELECT flights.* 
     FROM flights 
     INNER JOIN average_q1 ON flights.distance 
     INNER JOIN median_q1 ON flights.distance 
     WHERE distance BETWEEN median_q1.med AND average_q1.avg"
)

Solution

  • 1) The code in the question is not reproducible (see posting instructions at the top of the tag page) so we will use the following:

    library(sqldf)
    
    # test inputs
    DF <- data.frame(a = c(1, 1, 1, 2, 2, 2), b = 1:6)
    DFsum <- sqldf("select a, sum(b) as sum from DF group by a")
    
    sqldf("select *
      from DF A
      left join DFsum B using(a)")
    

    giving:

      a b sum
    1 1 1   6
    2 1 2   6
    3 1 3   6
    4 2 4  15
    5 2 5  15
    6 2 6  15
    

    2) This can be done using a window function which avoids the second table:

    sqldf("select *, sum(b) over (partition by a) as sum from DF")
    

    3) The aggregation can be combined right into the SQL statement like this:

    sqldf("select *
      from DF A
      left join (select a, sum(b) as sum 
                 from DF 
                 group by a) using(a)")
    

    4) This alternative is a different way of combining the SQL statements using a with clause, also known as a Common Table Expression (CTE)

    sqldf("with B as (
      select a, sum(b) as sum 
        from DF 
        group by a
    )
    select *
      from DF A
      left join B using(a)")