Given df1:
c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 |
---|---|---|---|---|---|---|---|
45 | 15 | 100 | 68 | 96 | 86 | 35 | 48 |
How to create df2:
columnName | ColumnValues |
---|---|
c1 | 45 |
c2 | 15 |
c3 | 100 |
c4 | 68 |
c5 | 96 |
c6 | 86 |
c7 | 35 |
c8 | 48 |
Question: Using pyspark
, if we are given dataframe df1
(shown above), how can we create a dataframe df2
that contains the column names of df1
in the first column and the values of df1
in the second second column?
REMARKS: Please note that df1
will be dynamic, it will change based on the data loaded to it. As shown below, I already know how to do it if df1
is static:
data = [['c1', 45], ['c2', 15], ['c3', 100]]
mycolumns = ["myCol1","myCol2"]
df = spark.createDataFrame(data, mycolumns)
df.show()
For a static df1, the above code will show df2 as:
|myCol1|myCol2|
|---|---|
|c1|45|
|c2|15|
|c3|100|
Try this -
from pyspark.sql.functions import *
data = [["java", "dbms", "python"]]
columns = ["Subject 1", "Subject 2", "Subject 3"]
df1 = spark.createDataFrame(data, columns)
column_names = df1.columns
stacked = df1.selectExpr("stack(" + str(len(column_names)) + "," + ",".join([f"'{col}', `{col}`" for col in column_names]) + ") as (columnName, columnValue)")
df2 = stacked.selectExpr("columnName", "columnValue")
df2.show()
+----------+----------------+
|columnName| columnValue|
+----------+----------------+
| Subject 1| java|
| Subject 2| dbms|
| Subject 3| python|
+----------+----------------+