Search code examples
scalaapache-sparkapache-spark-sql

How to get columns to show in select statement spark scala


I am using the code below to select the columns from 2 tables. I am using spark scala 2.11.11 and it runs, but, it will only return the package id and the number of packages. I need to see package id, number of packages, first name and last name in the result set. What am I missing in my code?

import org.apache.spark.sql.functions._
import spark.implicits._
flData_csv
  .toDF("packageId", "flId", "date", "to", "from")

customers_csv.toDF("packageId", "firstName", "lastName")

flData_csv
  .join(customers_csv, Seq("packageId"))
  .select("packageId", "count", "firstName", "lastName")
  .withColumnRenamed("packageId", "Package ID").groupBy("Package ID").count()
  .withColumnRenamed("count", "Number of Packages")
  .filter(col("count") >= 20)
  .withColumnRenamed("firstName", "First Name")
  .withColumnRenamed("lastName", "Last Name")
  .show(100)

Solution

  • After reading your code, I notice that there's .groupBy call after packageId renaming. After .groupBy call, basically you're left with group key(s) (Package ID in this case) and what comes with aggregation.

    I think adding firstName and lastName as group keys would solve your problem. Here's a sample code

    flData_csv
      .join(customers_csv, Seq("packageId"))
      .select("packageId", "count", "firstName", "lastName")
      .withColumnRenamed("packageId", "Package ID")
      .groupBy("Package ID", "firstName", "lastName").count()
      .withColumnRenamed("count", "Number of Packages")
      .filter(col("count") >= 20)
      .withColumnRenamed("firstName", "First Name")
      .withColumnRenamed("lastName", "Last Name")
      .show(100)