I have this dataset in R:
name = c("john", "john", "john", "sarah", "sarah", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2014, 2010, 2015, 2011, 2012, 2013, 2015)
age = c(21, 22, 25, 55, 60, 61, 62, 63, 65)
gender = c("male", "male", "male", "female", "female", "male", "male", "male", "male" )
country_of_birth = c("australia", "australia", "australia", "uk", "uk", "mexico", "mexico", "mexico", "mexico")
source = "ORIGINAL"
my_data = data.frame(name, year, age, gender, country_of_birth, source)
As we can see, some of the people in this dataset have rows with missing years (e.g. John goes from 2011 to 2014):
name year age gender country_of_birth source
1 john 2010 21 male australia ORIGINAL
2 john 2011 22 male australia ORIGINAL
3 john 2014 25 male australia ORIGINAL
4 sarah 2010 55 female uk ORIGINAL
5 sarah 2015 60 female uk ORIGINAL
6 peter 2011 61 male mexico ORIGINAL
7 peter 2012 62 male mexico ORIGINAL
8 peter 2013 63 male mexico ORIGINAL
9 peter 2015 65 male mexico ORIGINAL
I have this code that is able to add these missing rows by "interpolating" logical values of the missing rows (e.g. age increases by 1, country_of_birth stays the same, etc.), and records if this row was added later or original:
library(tidyverse)
library(dplyr)
# R Code to Convert into SQL
final = my_data %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))
# A tibble: 16 x 6
# Groups: name [3]
name year age gender country_of_birth source
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 john 2010 21 male australia ORIGINAL
2 john 2011 22 male australia ORIGINAL
3 john 2012 23 male australia NOT ORIGINAL
My Question: I am trying to learn how I can convert this above code into (Netezza) SQL Code.
To get an idea how to begin, I thought I could use the "dbplyr" library in R to convert my "dplyr" code into SQL:
library(dbplyr)
# attempt 1
remote_df = tbl_lazy(my_data, con = simulate_mysql())
remote_df %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "MISSING", source)) %>% show_query()
# attempt 2
remote_df = tbl_lazy(my_data, con = simulate_mssql())
remote_df %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "MISSING", source)) %>% show_query()
# attempt 3
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
remote_df <- copy_to(con, my_data)
remote_df %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))
# attempt 4
memdb_frame(my_data) %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "MISSING", source)) %>% show_query()
However all these attempts give me the same error:
Error in `fill()`:
x `.data` does not have explicit order.
i Please use `arrange()` or `window_order()` to make determinstic.
Run `rlang::last_error()` to see where the error occurred.
Can someone please show me what I am doing wrong and what I can do to convert this R code into SQL code? I had thought that perhaps I could find out which rows are missing for which person, create these rows - and then somehow use JOINS to bring them back to the original dataset in SQL.
Thanks!
In SQL you can create a "Cartesian product" of rows through a cross join
. I believe the equivalent in R is merge() . For the years that are missing, in SQL, you would need a table or resultset, but you should be able to use a sequence in R
Incorporating a sequence of years and the merge() function:
library(tidyverse)
library(dplyr)
# Create a data frame with the sequence of years
years_df <- data.frame(year = seq(2010, 2023))
# Perform a cross join with the original data
final <- merge(my_data, years_df, all = TRUE) %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))
The above is untested!
SQL Code:
CREATE TABLE mytable (
name VARCHAR(255),
year INTEGER,
age INTEGER,
gender VARCHAR(255),
country_of_birth VARCHAR(255),
source VARCHAR(255)
);
INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2010, 21, 'male', 'australia', 'ORIGINAL');
INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2011, 22, 'male', 'australia', 'ORIGINAL');
etc.
Example query with join that lists all years and joins to data when there is a match:
WITH RECURSIVE years (year) AS (
SELECT 2010
UNION ALL
SELECT year + 1
FROM years
WHERE year < 2023
)
SELECT
t.name, years.year, t.age, t.gender, t.country_of_birth, t.source
FROM years
LEFT JOIN mytable AS t ON years.year = t.year
;
When I reconsidered this you don't need a cross join. Instead you need a left join
or it can be expressed as left outer join
(in SQL).
In R, you can perform a left join using the merge function. Here’s an example:
# Create two example data frames
df1 <- data.frame(id = c(1, 2, 3), x = c("a", "b", "c"))
df2 <- data.frame(id = c(2, 3, 4), y = c("d", "e", "f"))
# Perform a left join
left_join <- merge(df1, df2, by = "id", all.x = TRUE)
# View the result
left_join
This creates two data frames df1 and df2 and then performs a left join on the id column using the merge function. The all.x = TRUE
argument specifies that all rows from df1 should be included in the result even if there is no matching row in df2. The resulting data frame contains all rows from df1 along with any matching rows from df2.