Search code examples
pythonapache-sparkpyspark

From a single row dataframe how to create a new dataframe containing list of column names and their values


Given df1:

c1 c2 c3 c4 c5 c6 c7 c8
45 15 100 68 96 86 35 48

How to create df2:

columnName ColumnValues
c1 45
c2 15
c3 100
c4 68
c5 96
c6 86
c7 35
c8 48

Question: Using pyspark, if we are given dataframe df1 (shown above), how can we create a dataframe df2 that contains the column names of df1 in the first column and the values of df1 in the second second column?

REMARKS: Please note that df1 will be dynamic, it will change based on the data loaded to it. As shown below, I already know how to do it if df1 is static:

data = [['c1', 45], ['c2', 15], ['c3', 100]]
mycolumns = ["myCol1","myCol2"]
df = spark.createDataFrame(data, mycolumns)
df.show()

For a static df1, the above code will show df2 as:

|myCol1|myCol2|
|---|---|
|c1|45|
|c2|15|
|c3|100|

Solution

  • Try this -

    from pyspark.sql.functions import *
    
    data = [["java", "dbms", "python"]]
    columns = ["Subject 1", "Subject 2", "Subject 3"]
    
    df1 = spark.createDataFrame(data, columns)
    
    column_names = df1.columns
    
    stacked = df1.selectExpr("stack(" + str(len(column_names)) + "," + ",".join([f"'{col}', `{col}`" for col in column_names]) + ") as (columnName, columnValue)")
    
    df2 = stacked.selectExpr("columnName", "columnValue")
    df2.show()
    
    +----------+----------------+
    |columnName|     columnValue|
    +----------+----------------+
    | Subject 1|            java|
    | Subject 2|            dbms|
    | Subject 3|          python|
    +----------+----------------+