Search code examples
pythonpandasdataframepysparkbigdata

PySpark - How to perform operations on specific columns?


I'm trying to perform round function on df.summary() dataframe, excluding the summary column. So far I've tried using using select() and a comprehension list e.g.

Code

df2 = df.select(*[round(column, 2).alias(column) for column in df.columns])

Output

This is the output of df2 the categorical values get converted into NULL.

+---------+-------+-------+-------+-------+
| Summary | col 1 | col 2 | col 3 | col 4 |
+---------+-------+-------+-------+-------+
| NULL    | 0     | 0.1   | 0.2   | 0.3   |
+---------+-------+-------+-------+-------+
| NULL    | 1     | 1.1   | 1.2   | 1.3   |
+---------+-------+-------+-------+-------+
| NULL    | 2     | 2.1   | 2.2   | 2.3   |
+---------+-------+-------+-------+-------+

Desired Output

I want only columns[1:] to be rounded.

+---------+-------+-------+-------+-------+
| Summary | col 1 | col 2 | col 3 | col 4 |
+---------+-------+-------+-------+-------+
| min     | 0     | 0.1   | 0.2   | 0.3   |
+---------+-------+-------+-------+-------+
| max     | 1     | 1.1   | 1.2   | 1.3   |
+---------+-------+-------+-------+-------+
| stddev  | 2     | 2.1   | 2.2   | 2.3   |
+---------+-------+-------+-------+-------+

I've also tried slicing df.columns[1:], but then it doesn't select the summary column.

df2 = df.select(*[round(column, 2).alias(column) for column in df.columns[1:])

Output

+-------+-------+-------+-------+
| col 4 | col 1 | col 2 | col 3 |
+-------+-------+-------+-------+
| 0.3   | 0     | 0.1   | 0.2   |
+-------+-------+-------+-------+
| 1.3   | 1     | 1.1   | 1.2   |
+-------+-------+-------+-------+
| 2.3   | 2     | 2.1   | 2.2   |
+-------+-------+-------+-------+

Solution

  • If you want to exclude the first column from the rounding operation, you can modify your code to selectively apply the rounding operation only to the desired columns. You may try the following:

    columns_to_round = df.columns[1:]
    rounded_df = df.selectExpr("Summary", *[f"round({column}, 2) as {column}" for column in columns_to_round])