I am trying to make this loop work, where I compare the value of a approx_count_distinct to a threshold. I would like to execute the if statement when the distinct_count is <2. but it always returns "NULL", even though when I print approx I get the right results (that are smaller than 2). What am I doing wrong?
for col in s:
approx = df.agg(approx_count_distinct(col).alias("count"))
if approx.collect()[0] < 2:
print(col)
I ended up doing it this way:
for col in s:
approx = df.agg(approx_count_distinct(col).alias("count"))
if (approx.select(F.col("count")).rdd.flatMap(lambda x: x).collect()[0]) < 2:
print(col)