I have daily timeseries for companies in my dataset and use PostgreSQL.
For every company all rows with NULL in column3 shall be deleted until the first NOT NULL entry in this column for this company. Then all consecutive missing values are filled in with the value of the last observable value for this company that is NOT NULL.
You can imagine the following example data:
date company column3
1 2004-01-01 A 5
2 2004-01-01 B NULL
3 2004-01-01 C NULL
4 2004-01-02 A NULL
5 2004-01-02 B 7
6 2004-01-02 C NULL
7 2004-01-03 A 6
8 2004-01-03 B 7
9 2004-01-03 C 9
10 2004-01-04 A NULL
11 2004-01-04 B NULL
12 2004-01-04 C NULL
It would be great if I manage to write a query that delivers
date company column3
1 2004-01-01 A 5
2 2004-01-02 A 5
3 2004-01-02 B 7
4 2004-01-03 A 6
5 2004-01-03 B 7
6 2004-01-03 C 9
7 2004-01-04 A 6
8 2004-01-04 B 7
9 2004-01-04 C 9
I tried:
SELECT a.date, a.company, COALESCE(a.column3, (SELECT b.column3 FROM mytable b
WHERE b.company=a.company AND b.colmun3 IS NOT NULL ORDER BY b.company=a.company
DESC LIMIT 1)) FROM mytable a;
There are two problems with the code:
I suggest two subquery levels with window functions instead of correlated subqueries:
SELECT *
FROM (
SELECT the_date, company, max(col3) OVER (PARTITION BY company, grp) AS col3
FROM (
SELECT *, count(col3) OVER (PARTITION BY company ORDER BY the_date) AS grp
FROM tbl
) sub1
) sub2
WHERE col3 IS NOT NULL
ORDER BY the_date, company;
Produces the requested result.
This assumes unique entries per (company, the_date)
. Should be much faster for tables with more than just a few rows. A (unique to enforce uniqueness?!) index helps performance a lot:
CREATE INDEX tbl_company_date_idx ON tbl (company, the_date);
The aggregate function count()
ignores null values when counting. Used as aggregate-window function, it computes the running count of a the column. This results in the count being "stuck" for rows with null values, thereby forming a peer group that shares the same (not-null) value.
In the second window function, the only not-null value per group is easily extracted with max()
(among other window functions). Rows before the first not-null value retain null, which is easily eliminated in the final SELECT
.
See: