Search code examples
sqlnetezza

SQL: Finding Out All Possible Combinations of Years Per ID per Group


I am working with Netezza SQL.

I have the following dataset ("my_table") about students (over the years 2010-2015), the current degree major they are enrolled in, the date than an exam was taken, and the exam result:

    student_id current_major year exam_result
1            1       Science 2010           0
2            1          Arts 2013           1
3            1          Arts 2013           0
4            2       Science 2010           1
5            2          Arts 2011           1
6            2       Science 2013           1
7            3          Arts 2010           1
8            3          Arts 2015           1
9            4          Arts 2010           0
10           4       Science 2013           1
11           5          Arts 2010           0
12           5          Arts 2011           0
13           5       Science 2012           1

My Question: I want to find out if the initial degree that a student started with affects how many years the students stayed in the university. (The period of analysis is from 2010 to 2015 - a student could join the university anywhere between 2010 and 2015.)

To answer this question, I first wanted to make a table that shows the number of students that attended the college for each combination of years:

select year_2010,
 year_2011,
year_2012,
 year_2013,
 year_2014,
 year_2015, 
count(*) 
from
(
select student_id,
max(case when (year = 2010) then 1 else 0 end) as 'year_2010',
max(case when (year = 2011) then 1 else 0 end) as 'year_2011',
max(case when (year = 2012) then 1 else 0 end) as 'year_2012',
max(case when (year = 2013) then 1 else 0 end) as 'year_2013',
max(case when (year = 2014) then 1 else 0 end) as 'year_2014',
max(case when (year = 2015) then 1 else 0 end) as 'year_2015',
from my_table
group by student_id)a
group by year_2010,
 year_2011,
 year_2012,
 year_2013,
 year_2014,
 year_2015;


   year_2010 year_2011 year_2012 year_2013 year_2014 year_2015 count_star()
1          0         1         0         1         0         0           24
2          0         0         0         0         1         1           17
3          0         0         1         1         0         0           22
4          1         1         1         0         0         0           12
5          0         1         1         1         1         0           12
6          1         0         0         0         1         0           23
7          0         0         1         0         0         0           49

My Problem: Now, I want to "group" this table using the earliest available major for each student (i.e. what was the student studying in their first recorded year at the university). For the sake of this problem, let's assume that a student can not change their major in the first year and must wait until at least the second year to do this. However, from the second year onwards, a student can switch their degree major multiple times within the same year.

I would like to answer questions such as:

  • Of the students who enrolled in the university in 2010 and initially started studying sciences - how many of these students studied 5 consecutive years?

  • Of the students who enrolled in the university in 2011 and initially started studying arts - how many of these students studied at least 2 years between 2011 and 2015?

I think that such questions can be answered by:

  • Step 1: Using a PARTITION function to find out the earliest year (via a CTE)
  • Step 2: Using the existing query
  • Step 3: Joining the results from Step 1 and Step 2 together

Here is my attempt to do this:

WITH earliest_major AS (
    SELECT student_id, current_major AS earliest_major
    FROM (
        SELECT student_id, current_major, year,
        ROW_NUMBER() OVER (PARTITION BY student_id ORDER BY year) AS rn
        FROM my_table
    ) sub
    WHERE rn = 1
)
SELECT em.earliest_major,
       year_2010,
       year_2011,
       year_2012,
       year_2013,
       year_2014,
       year_2015, 
       COUNT(*) 
FROM (
    SELECT student_id,
           MAX(CASE WHEN (year = 2010) THEN 1 ELSE 0 END) AS year_2010,
           MAX(CASE WHEN (year = 2011) THEN 1 ELSE 0 END) AS year_2011,
           MAX(CASE WHEN (year = 2012) THEN 1 ELSE 0 END) AS year_2012,
           MAX(CASE WHEN (year = 2013) THEN 1 ELSE 0 END) AS year_2013,
           MAX(CASE WHEN (year = 2014) THEN 1 ELSE 0 END) AS year_2014,
           MAX(CASE WHEN (year = 2015) THEN 1 ELSE 0 END) AS year_2015
    FROM my_table
    GROUP BY student_id
) a
JOIN earliest_major em ON a.student_id = em.student_id
GROUP BY em.earliest_major, 
         year_2010, 
         year_2011, 
         year_2012, 
         year_2013, 
         year_2014, 
         year_2015;

The query seems to run and produce results in the desired format:

  earliest_major year_2010 year_2011 year_2012 year_2013 year_2014 year_2015 count_star()
1         Science         0         1         0         1         0         0           15
2            Arts         0         0         0         0         1         1           11
3            Arts         0         0         1         1         0         0           13
4            Arts         1         1         1         0         0         0            8
5         Science         0         1         1         1         1         0            7

But I am not sure if my logic is correct - can someone please help me with this?

Thanks!


Solution

  • I think you should drop your columnar format and your cases. You effectively want a histogram. It's preferable for this to be in "long" format; the application layer can reformat that if necessary.

    create table major_exams(
      student_id int not null,
      current_major text not null,
      year smallint not null check (year between 1900 and 2200),
      exam_result boolean not null
    );
    
    insert into major_exams(student_id, current_major, year, exam_result) values
    ( 1, 'Science', 2010, false),
    ( 1,    'Arts', 2013, true),
    ( 1,    'Arts', 2013, false),
    ( 2, 'Science', 2010, true),
    ( 2,    'Arts', 2011, true),
    ( 2, 'Science', 2013, true),
    ( 3,    'Arts', 2010, true),
    ( 3,    'Arts', 2015, true),
    ( 4,    'Arts', 2010, false),
    ( 4, 'Science', 2013, true),
    ( 5,    'Arts', 2010, false),
    ( 5,    'Arts', 2011, false),
    ( 5, 'Science', 2012, true);
    
    select major, duration, count(*) as n
    from (
        select first_majors.major,
            last_majors.year - first_majors.year + 1 as duration
        from (
            select distinct on (student_id) student_id, current_major as major, year
            from major_exams
            order by student_id, year
        ) first_majors
        join (
            select student_id, max(year) as year
            from major_exams
            group by student_id
        ) last_majors on last_majors.student_id = first_majors.student_id
    ) major_bounds
    group by major, duration
    order by major, duration;
    
    major       duration    n
    Arts        3           1
    Arts        4           1
    Arts        6           1
    Science     4           2