I am attempting to drop a column using the drop function. But the column remains after the drop. The problem is evident in the following code:
val students =
Seq(
(1, 100, "Steve"),
(2, 101, "Peter"),
(3, 101, "Debby"),
(4, 102, "Michael")
).toDF("student_id", "dept_id", "student_first_name")
val validDepartments =
Seq(
(100),
(102),
).toDF("dept_id")
val validStudents =
students.as("s")
.join(validDepartments.as("d"), col("s.dept_id") === col("d.dept_id"))
.drop("d.dept_id")
validStudents.show(false)
This outputs:
+----------+-------+------------------+-------+
|student_id|dept_id|student_first_name|dept_id|
+----------+-------+------------------+-------+
|1 |100 |Steve |100 |
|4 |102 |Michael |102 |
+----------+-------+------------------+-------+
I wasn't expecting to see the last "dept_id" (because of .drop("d.dept_id")
). What am I missing?
Is it a bug as stated here?
Better to use a bit different signature of join
to achieve what you want, next code will work
val students = Seq(
(1, 100, "Steve"),
(2, 101, "Peter"),
(3, 101, "Debby"),
(4, 102, "Michael")
).toDF("student_id", "dept_id", "student_first_name")
val validDepartments =
Seq(
(100),
(102),
).toDF("dept_id")
val validStudents =
students
.join(validDepartments, Seq("dept_id"))
validStudents.show(false)
or, if you want your way of join
val students = Seq(
(1, 100, "Steve"),
(2, 101, "Peter"),
(3, 101, "Debby"),
(4, 102, "Michael")
).toDF("student_id", "dept_id", "student_first_name")
val validDepartments =
Seq(
(100),
(102),
).toDF("dept_id")
val validStudents =
students.as("s")
.join(validDepartments.as("d"), col("s.dept_id") === col("d.dept_id"))
.drop(validDepartments("dept_id"))
validStudents.show(false)