Search code examples
scalaapache-spark

DF has column after drop


I am attempting to drop a column using the drop function. But the column remains after the drop. The problem is evident in the following code:

val students =
  Seq(
    (1, 100, "Steve"),
    (2, 101, "Peter"),
    (3, 101, "Debby"),
    (4, 102, "Michael")
  ).toDF("student_id", "dept_id", "student_first_name")

val validDepartments =
  Seq(
    (100),
    (102),
  ).toDF("dept_id")

val validStudents = 
  students.as("s")
    .join(validDepartments.as("d"), col("s.dept_id") === col("d.dept_id"))
    .drop("d.dept_id")
validStudents.show(false)

This outputs:

+----------+-------+------------------+-------+
|student_id|dept_id|student_first_name|dept_id|
+----------+-------+------------------+-------+
|1         |100    |Steve             |100    |
|4         |102    |Michael           |102    |
+----------+-------+------------------+-------+

I wasn't expecting to see the last "dept_id" (because of .drop("d.dept_id")). What am I missing?

Is it a bug as stated here?


Solution

  • Better to use a bit different signature of join

    to achieve what you want, next code will work

    val students = Seq(
        (1, 100, "Steve"),
        (2, 101, "Peter"),
        (3, 101, "Debby"),
        (4, 102, "Michael")
      ).toDF("student_id", "dept_id", "student_first_name")
    
    val validDepartments =
      Seq(
        (100),
        (102),
      ).toDF("dept_id")
    
    val validStudents =
      students
        .join(validDepartments, Seq("dept_id"))
    validStudents.show(false)
    

    or, if you want your way of join

    val students = Seq(
        (1, 100, "Steve"),
        (2, 101, "Peter"),
        (3, 101, "Debby"),
        (4, 102, "Michael")
      ).toDF("student_id", "dept_id", "student_first_name")
    
    val validDepartments =
      Seq(
        (100),
        (102),
      ).toDF("dept_id")
    
    val validStudents =
        students.as("s")
          .join(validDepartments.as("d"), col("s.dept_id") === col("d.dept_id"))
          .drop(validDepartments("dept_id"))
    validStudents.show(false)