sql apache-spark pyspark analytics multilabel-classification

In SQL how do I group by every one of a long list of columns and get counts, assembled all into one table?

I have performed a stratified sample on a multi-label dataset before training a classifier and want to check how balanced it is now. The columns in the dataset are:

|_Body|label_0|label_1|label_10|label_100|label_101|label_102|label_103|label_104|label_11|label_12|label_13|label_14|label_15|label_16|label_17|label_18|label_19|label_2|label_20|label_21|label_22|label_23|label_24|label_25|label_26|label_27|label_28|label_29|label_3|label_30|label_31|label_32|label_33|label_34|label_35|label_36|label_37|label_38|label_39|label_4|label_40|label_41|label_42|label_43|label_44|label_45|label_46|label_47|label_48|label_49|label_5|label_50|label_51|label_52|label_53|label_54|label_55|label_56|label_57|label_58|label_59|label_6|label_60|label_61|label_62|label_63|label_64|label_65|label_66|label_67|label_68|label_69|label_7|label_70|label_71|label_72|label_73|label_74|label_75|label_76|label_77|label_78|label_79|label_8|label_80|label_81|label_82|label_83|label_84|label_85|label_86|label_87|label_88|label_89|label_9|label_90|label_91|label_92|label_93|label_94|label_95|label_96|label_97|label_98|label_99|

I want to group by every label_* column once, and create a dictionary of the results with positive/negative counts. At the moment I am accomplishing this in PySpark SQL like this:

# Evaluate how skewed the sample is after balancing it by resampling
stratified_sample = spark.read.json('s3://stackoverflow-events/1901/Sample.Stratified.{}.*.jsonl'.format(limit))
stratified_sample.registerTempTable('stratified_sample')

label_counts = {}
for i in range(0, 100):
  count_df = spark.sql('SELECT label_{}, COUNT(*) as total FROM stratified_sample GROUP BY label_{}'.format(i, i))
  rows = count_df.rdd.take(2)
  neg_count = getattr(rows[0], 'total')
  pos_count = getattr(rows[1], 'total')
  label_counts[i] = [neg_count, pos_count]

The output is thus:

{0: [1034673, 14491],
 1: [1023250, 25914],
 2: [1030462, 18702],
 3: [1035645, 13519],
 4: [1037445, 11719],
 5: [1010664, 38500],
 6: [1031699, 17465],
 ...}

This feels like it should be possible in one SQL statement, but I can't figure out how to do this or find an existing solution. Obviously I don't want to write out all the column names and generating SQL seems worse than this solution.

Can SQL do this? Thanks!

Solution

You can generate sql without group by.

Something like

SELECT COUNT(*) AS total, SUM(label_k) as positive_k ,.. FROM table

And then use the result to produce your dict {k : [total-positive_k, positive_k]}