Search code examples
sparse-matrixparquetamazon-athena

athena query to collapse sparse matrix / columnar data


Suppose I have the following table table1 in Amazon Athena (technically, in Glue):

torg, foo, bar, x1, x2, x3, baz
-------------------------------
t1, 1, NULL,   NULL, NULL, NULL, goober
t2, 1, NULL,   NULL, NULL, NULL, NULL  -- want to ignore this row entirely, even though foo=1
t3, 1, tronic, NULL, NULL, NULL, NULL
...
t4, 2, NULL,   NULL, NULL, NULL, horse
t4, 2, zebra,  NULL, NULL, NULL, NULL
...

Is there an Athena query that can group by foo (in this example foo=1), but ignoring NULL column values for each row, such that a result would look like?

foo, bar, baz
----------------
1, tronic, goober

For context, I am writing the as partitioned parquet to S3 at different times. It's wonderful using Athena to query this data, but I'm having a hard time envisioning if Athena is the right place to "collapse" this columnar data, into more "row-wise" data grouped by a particular column (or set of columns).

While I will likely further work with this data in pandas, and could "collapse" it there, I was hoping there might be a query pattern in Athena. Or perhaps I'm thinking about this entirely wrong...

Any insight greatly appreciated.


Solution

  • If you have only one value per group then you can use max (or min) and group by:

    WITH dataset(foo,bar, baz) AS (
       VALUES 
        (1,NULL,'tronic'),
        (1,'goober',NULL),
        (1,NULL, NULL)
     ) 
     
    SELECT foo, max(bar) bar, max(baz) baz
    FROM dataset
    GROUP BY foo
    

    Output:

    foo bar baz
    1 goober tronic