Search code examples
sqlcorrelationprestoamazon-athenapearson-correlation

How to compare a column against every other column in a SQL query with a wide table?


I'm trying to implement a query in AWS Athena where I compute the Pearson Correlation coefficient of a selected column against every other column in a matrix, and return a sorted list.

The data is stored in S3 as a CSV file and looks like this:

col_1 col_2 col_3 col_4 .... col_15000
  0    124    56   200  ....   4
  0     0     44    0   ....   0
....

I want to be able to calculate corr(col_1, col_2), corr(col_1, col_3), ...cor(col_1, col_15000) and sort the results by the correlation coefficients (find the most/least correlated columns).

What is the most efficient way to write this query? I would like this query to be as performant as possible. The obvious answer is to generate a query like this:

SELECT
corr(col_1, col_2) AS cor_2,
corr(col_1, col_3) AS cor_3,
...

corr(col_1, col_15000) AS cor_15000

However, this seems tedious, and the query size limit is quickly reached. Is there a better way that doesn't sacrifice (or increases) performance? This can easily be parallelized, as each individual correlation coefficient can be computed independently.


Solution

  • I know this is not the answer you probably are looking for but this is does not seem like something that should be done via Athena/SQL/Presto. The needing thousands of custom columns is a big red flag.

    This sounds more like a job for a Spark Job which could be run in AWS Glue via an ETL Job.

    Since your data is already in Athena, it should already be cataloged in Glue and you can use a GlueContext in spark to load your data frame directly from that datasource.

    Spark jobs can be done in Python (via pyspark) or Scala. Creating these coefficient columns via a code loop and then writing them out to another file shouldn't be a very complicated script.

    Assuming you are unfamiliar with most of this, it may be good to go through this example/tutorial: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html