Search code examples
dataframepyspark

PySpark Column Data Turn to Row Data


I have below PySpark Dataframe:

enter image description here

I would like to using PySpark Datafream to convert to:

Adding one more column to indicate their date range and all city level data turn from column data to row data and adding their data range.

enter image description here

What is the best way achieve in PySpark?


Solution

  • Please try below code -

    from pyspark.sql.functions import col, lit
    
    # Define the data
    data = [
        ("USA", "NY", "NYC", 3, 5, 12, 16),
        ("USA", "NY", "Queen", 1, 3, 4, 12),
        ("USA", "CA", "SF", 23, 24, 11, 12)
    ]
    
    # Create an initial DataFrame
    df = spark.createDataFrame(data, ["Country", "State", "City", "Order_1d", "Order_2d", "Count_1d", "Count_2d"])
    
    # Create two separate DataFrames for 1d and 2d data
    df1 = df.select("Country", "State", "City", "Order_1d", "Count_1d")
    df2= df.select("Country", "State", "City", "Order_2d", "Count_2d")
    
    # Add a 'Date_Range' column to each DataFrame
    df1 = df1.withColumn("Date_Range", lit("1d"))
    df2 = df2.withColumn("Date_Range", lit("2d"))
    
    # Rename columns to match the expected output
    df1 = df1.withColumnRenamed("Order_1d", "Order").withColumnRenamed("Count_1d", "Count")
    df2 = df2.withColumnRenamed("Order_2d", "Order").withColumnRenamed("Count_2d", "Count")
    
    # Combine the two DataFrames
    final_df = df1.union(df2)
    
    # Show the final DataFrame
    final_df.show()