Search code examples
pysparkapache-spark-sql

Insert selective columns into pyspark dataframe


I have a dataframe with the following schema :

col1 as col1, col 2, col 3, . . col n

I have another dataframe with the following schema

col1 as diffName, col n+1, col n+2 . . .

how can i append values in col 1 of dataframe and the rest of the columns as null values?

i have tried using union as well as merge but to no avail


Solution

  • You could manually make the columns of both dataframes match, then union them. For example, with these dataframes:

    df1 = spark.createDataFrame([
        (1, 2),
        (3, 4)
    ], ['a', 'b'])
    
    df2 = spark.createDataFrame([
        (5, 6, 7),
        (8, 9, 10)
    ], ['a2', 'c', 'd'])
    
    df2 = df2.withColumnRenamed('a2', 'a')
    

    The first step would be to make sure the column that is common to both dataframes has the same name:

    df2 = df2.withColumnRenamed('a2', 'a')
    

    Then you can make the columns match:

    for c in df2.columns:
        if c not in df1.columns:
            df1 = df1.withColumn(c, lit(None))
    
    for c in df1.columns:
        if c not in df2.columns:
            df2 = df2.withColumn(c, lit(None))
    

    Finally, you can take the union. I find unionByName to be safer:

    df_all = df1.unionByName(df2)