Search code examples
sqlrjoindata-manipulation

Adding Missing Rows in SQL using JOINS


I have this dataset in R:

name = c("john", "john", "john", "sarah", "sarah", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2014, 2010, 2015, 2011, 2012, 2013, 2015)
age = c(21, 22, 25, 55, 60, 61, 62, 63, 65)
gender = c("male", "male", "male", "female", "female", "male", "male", "male", "male" )
country_of_birth = c("australia", "australia", "australia", "uk", "uk", "mexico", "mexico", "mexico", "mexico")
source = "ORIGINAL"

my_data = data.frame(name, year, age, gender, country_of_birth, source)

As we can see, some of the people in this dataset have rows with missing years (e.g. John goes from 2011 to 2014):

   name year age gender country_of_birth   source
1  john 2010  21   male        australia ORIGINAL
2  john 2011  22   male        australia ORIGINAL
3  john 2014  25   male        australia ORIGINAL
4 sarah 2010  55 female               uk ORIGINAL
5 sarah 2015  60 female               uk ORIGINAL
6 peter 2011  61   male           mexico ORIGINAL
7 peter 2012  62   male           mexico ORIGINAL
8 peter 2013  63   male           mexico ORIGINAL
9 peter 2015  65   male           mexico ORIGINAL

I have this code that is able to add these missing rows by "interpolating" logical values of the missing rows (e.g. age increases by 1, country_of_birth stays the same, etc.), and records if this row was added later or original:

library(tidyverse)
library(dplyr)

# R Code to Convert into SQL
final = my_data %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))

# A tibble: 16 x 6
# Groups:   name [3]
   name   year   age gender country_of_birth source      
   <chr> <dbl> <dbl> <chr>  <chr>            <chr>       
 1 john   2010    21 male   australia        ORIGINAL    
 2 john   2011    22 male   australia        ORIGINAL    
 3 john   2012    23 male   australia        NOT ORIGINAL

My Question: I am trying to learn how I can convert this above code into (Netezza) SQL Code.

To get an idea how to begin, I thought I could use the "dbplyr" library in R to convert my "dplyr" code into SQL:

library(dbplyr)

# attempt 1
remote_df = tbl_lazy(my_data, con = simulate_mysql())

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

# attempt 2
remote_df = tbl_lazy(my_data, con = simulate_mssql())

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

# attempt 3

 con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

 remote_df <- copy_to(con, my_data)

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))



# attempt 4

memdb_frame(my_data) %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

However all these attempts give me the same error:

Error in `fill()`:
x `.data` does not have explicit order.
i Please use `arrange()` or `window_order()` to make determinstic.
Run `rlang::last_error()` to see where the error occurred.

Can someone please show me what I am doing wrong and what I can do to convert this R code into SQL code? I had thought that perhaps I could find out which rows are missing for which person, create these rows - and then somehow use JOINS to bring them back to the original dataset in SQL.

Thanks!


Solution

  • In SQL you can create a "Cartesian product" of rows through a cross join. I believe the equivalent in R is merge() . For the years that are missing, in SQL, you would need a table or resultset, but you should be able to use a sequence in R

    Incorporating a sequence of years and the merge() function:

    library(tidyverse)
    library(dplyr)
    
    # Create a data frame with the sequence of years
    years_df <- data.frame(year = seq(2010, 2023))
    
    # Perform a cross join with the original data
    final <- merge(my_data, years_df, all = TRUE) %>% 
        group_by(name) %>% 
        complete(year = first(year): last(year)) %>% 
        mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
        fill(c(gender, country_of_birth), .direction = "down") %>% 
        mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))
    

    The above is untested!

    SQL Code:

    CREATE TABLE mytable (
        name VARCHAR(255),
        year INTEGER,
        age INTEGER,
        gender VARCHAR(255),
        country_of_birth VARCHAR(255),
        source VARCHAR(255)
    );
    
    INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2010, 21, 'male', 'australia', 'ORIGINAL');
    INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2011, 22, 'male', 'australia', 'ORIGINAL');
    etc.
    

    Example query with join that lists all years and joins to data when there is a match:

    WITH RECURSIVE years (year) AS (
        SELECT 2010
        UNION ALL
        SELECT year + 1
        FROM years
        WHERE year < 2023
    )
    SELECT 
        t.name, years.year, t.age, t.gender, t.country_of_birth, t.source
    FROM years
    LEFT JOIN mytable AS t ON years.year = t.year
    ;
    

    When I reconsidered this you don't need a cross join. Instead you need a left join or it can be expressed as left outer join (in SQL).

    In R, you can perform a left join using the merge function. Here’s an example:

    # Create two example data frames
    df1 <- data.frame(id = c(1, 2, 3), x = c("a", "b", "c"))
    df2 <- data.frame(id = c(2, 3, 4), y = c("d", "e", "f"))
    
    # Perform a left join
    left_join <- merge(df1, df2, by = "id", all.x = TRUE)
    
    # View the result
    left_join
    

    This creates two data frames df1 and df2 and then performs a left join on the id column using the merge function. The all.x = TRUE argument specifies that all rows from df1 should be included in the result even if there is no matching row in df2. The resulting data frame contains all rows from df1 along with any matching rows from df2.