Search code examples
scala

Creating a new column using info from another df


I'm trying to create a new column based off information from another data table.

df1

Loc Time   Wage
1    192    1
3    192    2
1    193    3
5    193    3
7    193    5
2    194    7

df2

Loc  City
1    NYC
2    Miami
3    LA
4    Chicago
5    Houston
6    SF
7    DC

desired output:

Loc Time   Wage  City
1    192    1    NYC
3    192    2    LA
1    193    3    NYC
5    193    3    Houston
7    193    5    DC
2    194    7    Miami

The actual dataframes vary quite largely in terms of row numbers, but its something along the lines of that. I think this might be achievable through .map but I haven't found much documentation for that online. join doesn't really seem to fit this situation.


Solution

  • join is exactly what you need. Try running this in the spark-shell

    import spark.implicits._
    
    val col1 = Seq("loc", "time", "wage")
    val data1 = Seq((1, 192, 1), (3, 193, 2), (1, 193, 3), (5, 193, 3), (7, 193, 5), (2, 194, 7))
    val col2 = Seq("loc", "city")
    val data2 = Seq((1, "NYC"), (2, "Miami"), (3, "LA"), (4, "Chicago"), (5, "Houston"), (6, "SF"), (7, "DC"))
    
    val df1 = spark.sparkContext.parallelize(data1).toDF(col1: _*)
    val df2 = spark.sparkContext.parallelize(data2).toDF(col2: _*)
    
    val outputDf = df1.join(df2, Seq("loc"))  // join on the column "loc"
    
    outputDf.show()
    

    This will output

    +---+----+----+-------+
    |loc|time|wage|   city|
    +---+----+----+-------+
    |  1| 192|   1|    NYC|
    |  1| 193|   3|    NYC|
    |  2| 194|   7|  Miami|
    |  3| 193|   2|     LA|
    |  5| 193|   3|Houston|
    |  7| 193|   5|     DC|
    +---+----+----+-------+