Search code examples
joinpysparkviewapache-spark-sqldrop

Why dropping columns after joining two temp views in pyspark isn't working, which however works with data frame joins?


I need to create tempViews out of dataframes, then i need to join these tempviews and drop the duplicate columns. So I've written the code as mentioned below:

 Data1 = [  ("xx1",34,60),
           ("xx2",33,80),
           ("xx3",37,50) ]

dataSchema1 = StructType([
    StructField("Name",StringType(),True),
    StructField("Age",IntegerType(),True),
    StructField("Id",IntegerType(),True)
   ])

Data2 = [  (60,"M",3000.60),
           (80,"F",3300.80),
           (50,"M",5000.50) ]

dataSchema2 = StructType([
    StructField("Id",IntegerType(),True),
    StructField("Gender", StringType(), True),
    StructField("Salary", DoubleType(), True)
  ])

df1 = spark.createDataFrame(spark.sparkContext.parallelize(Data1),schema=dataSchema1)
df1.createOrReplaceTempView('view1')

df2 = spark.createDataFrame(spark.sparkContext.parallelize(Data2),schema=dataSchema2)
df2.createOrReplaceTempView('view2')

jDF=spark.sql("select * from view1 join view2 on view1.Id = view2.Id")
jDF.columns                 // ['Name', 'Age', 'Id', 'Id', 'Gender', 'Salary']
rjDF=jDF.drop('view2.ID')    //this function is not working
rjDF.columns                // ['Name', 'Age', 'Id', 'Id', 'Gender', 'Salary']

In the above code, drop column method is not working as expected, also its not throwing any error. However, if i try to drop columns using data frames(which obviously not a preferable choice for me in my use-case), then drop method is working fine.

joinDF=df1.join(df2, df1.Id == df2.Id)
dropped=joinDF.drop(df2.Id)    // working absolutely fine
dropped.columns               // ['Name', 'Age', 'Id', 'Gender', 'Salary']

can anyone help me in understanding whats wrong with the first approach of drop a column from joined temp-views?


Solution

  • drop from the dataset takes column names or column itself to drop. This is no-op operation if schema doesn't contain column name(s).

    drop internally uses analyzer.resolver to check whether provided string is present in the dataframe or not. Please note that you can't provide the sql like syntax of referring columns in the drop If you provide the same spark will assume whole string as column name.

    selectExpr("..") and functions.expr("..") internally uses sessionState.sqlParser to parse the sql like syntax of referring columns like <table/view name>.<column_name>.

    try using the same with expr function if you wanted to use sql like syntax-

    rjDF=jDF.drop(F.expr('view2.Id'))
    

    else you vcan go with your working column based drop-

    dropped=joinDF.drop(df2.Id)