Search code examples
apache-sparkpysparkapache-spark-sqlrenamedrop

How can I rename a column based on a cell value in Pyspark?


Currently I have this Situation:

   signal_name  timestamp   signal_value
0  alert        1632733513  on
1  alert        1632733515  off
2  alert        1632733518  on

I want to rename the column signal_value with the signal_name. The df was filtered after the signal name alert so there is no other value for signal_name.

   signal_name  timestamp   alert
0  alert        1632733513  on
1  alert        1632733515  off
2  alert        1632733518  on

Due to the fact that the signal name is addressed, the first column is no longer needed. So I would like to drop it.

   timestamp    alert
0  1632733513   on
1  1632733515   off
2  1632733518   on

Since there are multiple df (based on other signal_name) with this problem, this approach should be generic.


Solution

  • If you control the part where the dataframe is filtered on the signal_name then you can rename the column with the same value used in the filter.

    Otherwise, you can select the first value of signal_name column into python variable then use it to rename the column signal_value:

    data = [("alert", "1632733513", "on"), ("alert", "1632733515", "off"), ("alert", "1632733518", "on")]
    df = spark.createDataFrame(data, ["signal_name", "timestamp", "signal_value"])
    
    signal_name = df.select("signal_name").first().signal_name
    
    df1 = df.withColumnRenamed("signal_value", signal_name).drop("signal_name")
    
    df1.show()
    
    # +----------+-----+
    # | timestamp|alert|
    # +----------+-----+
    # |1632733513|   on|
    # |1632733515|  off|
    # |1632733518|   on|
    # +----------+-----+