Search code examples
loopsif-statementpysparkcomparison

Pyspark: compare values and if true execute statement


I am trying to make this loop work, where I compare the value of a approx_count_distinct to a threshold. I would like to execute the if statement when the distinct_count is <2. but it always returns "NULL", even though when I print approx I get the right results (that are smaller than 2). What am I doing wrong?

for col in s:
    approx = df.agg(approx_count_distinct(col).alias("count"))
    if approx.collect()[0] < 2:
        print(col)

Solution

  • I ended up doing it this way:

    for col in s:
        approx = df.agg(approx_count_distinct(col).alias("count"))
        if (approx.select(F.col("count")).rdd.flatMap(lambda x: x).collect()[0]) < 2:
            print(col)