Search code examples
sqlpostgresqlsql-order-byfull-outer-join

Merging some columns from two postgres tables into a new table based on row value


Hello PostgresSQL experts (and maybe this is also a task for Perl's DBI since I also happen to be working with it, but...) I might also have some terminologies misused here so bear with me.

I have a set of 32 tables, every one exactly as the other. The first column of every table always contains a date, while the second column contains values (integers) that can change once every 24 hours, some samples get back-dated. In many cases, the tables may not contain data for a particular date, ever. So here's an example of two such tables:

  date_list      | sum      date_list  | sum
----------------------  --------------------------
 2020-03-12     |   4      2020-03-09  |     1
 2020-03-14     |   5      2020-03-11  |     3
                |          2020-03-12  |     5
                |          2020-03-13  |     9
                |          2020-03-14  |    12

The idea is to merge the separate tables into one, sort of like a grid, but with the samples placed in the correct row in its own column and ensuring that the date column (always the first column) is not missing any dates, looking like this:

   date_list     | sum1  |  sum2    |  sum3  .... | sum32
---------------------------------------------------------
 2020-03-08     |       |          |
 2020-03-09     |       |     1    |
 2020-03-10     |       |          |   5
 2020-03-11     |       |     3    |  25 
 2020-03-12     |     4 |     5    |  35
 2020-03-13     |       |     9    |  37
 2020-03-14     |     5 |    12    |  40

And so on, so 33 columns by 2020-01-01 to date.

Now, I have tried doing a FULL OUTER JOIN and it succeeds. It's the subsequent attempts that get me trouble, creating a long, cascading table with the values in the wrong place or accidentally clobbering data. So I know this works if I use a table of one column with a date sequence and joining the first data table, just as a test of my theory using baby steps:

SELECT date_table.date_list, sums_1.sum FROM date_table FULL OUTER JOIN sums_1 ON date_table.date_list = sums_1.date_list

 2020-03-07     |     1
 2020-03-08     |      
 2020-03-09     |      
 2020-03-10     |     2
 2020-03-11     |      
 2020-03-12     |     4

Encouraged, I thought I'd get a little more ambitious with my testing, but that places some rows out of sequence to the bottom of the table and I'm not sure that I'm losing data or not, this time trying USING as an alternative:

SELECT * FROM sums_1 FULL OUTER JOIN sums_2 USING (date_list);

Result:

 fecha_sintomas |  sum  |  sum  
----------------+-------+-------
 2020-03-09     |       |     1
 2020-03-11     |       |     3
 2020-03-12     |     4 |     5
 2020-03-13     |       |     9
 2020-03-14     |     5 |    12
 2020-03-15     |     6 |    15
 2020-03-16     |     8 |    20
    :               :        :
 2020-10-29     | 10053 | 22403
 2020-10-30     | 10066 | 22407
 2020-10-31     | 10074 | 22416
 2020-11-01     | 10076 | 22432
 2020-11-02     | 10077 | 22434
 2020-03-07     |     1 |      
 2020-03-10     |     2 |      
(240 rows)

I think I'm getting close. In any case, where do I get to what I want, which is my grid of data described above? Maybe this is an iterative process that could benefit from using DBI?

Thanks,


Solution

  • You can full join like so:

    select date_list, s1.sum as sum1, s2.sum as sum2, s3.sum as sum3
    from sums_1 s1
    full join sums_2 s2 using (date_list)
    full join sums_3 s3 using (date_list)
    order by date_list;
    

    The using syntax makes unqualified column date_list unambiguous in the select and order by clauses. Then, we need to enumerate the sum columns, provided aliases for each of them.