I have below PySpark Dataframe:
I would like to using PySpark Datafream to convert to:
Adding one more column to indicate their date range and all city level data turn from column data to row data and adding their data range.
What is the best way achieve in PySpark?
Please try below code -
from pyspark.sql.functions import col, lit
# Define the data
data = [
("USA", "NY", "NYC", 3, 5, 12, 16),
("USA", "NY", "Queen", 1, 3, 4, 12),
("USA", "CA", "SF", 23, 24, 11, 12)
]
# Create an initial DataFrame
df = spark.createDataFrame(data, ["Country", "State", "City", "Order_1d", "Order_2d", "Count_1d", "Count_2d"])
# Create two separate DataFrames for 1d and 2d data
df1 = df.select("Country", "State", "City", "Order_1d", "Count_1d")
df2= df.select("Country", "State", "City", "Order_2d", "Count_2d")
# Add a 'Date_Range' column to each DataFrame
df1 = df1.withColumn("Date_Range", lit("1d"))
df2 = df2.withColumn("Date_Range", lit("2d"))
# Rename columns to match the expected output
df1 = df1.withColumnRenamed("Order_1d", "Order").withColumnRenamed("Count_1d", "Count")
df2 = df2.withColumnRenamed("Order_2d", "Order").withColumnRenamed("Count_2d", "Count")
# Combine the two DataFrames
final_df = df1.union(df2)
# Show the final DataFrame
final_df.show()