Search code examples
dataframeapache-sparkjoinpyspark

Join two PySpark DataFrames and get some of the columns from one DataFrame when column names are similar


I want to join 2 PySpark DataFrames. But, I want all columns from one DataFrame, and some of columns from the 2nd DataFrame. The point is that there is a column with similar name in both DataFrames.

Sample DataFrames:

# Prepare Data
data_1 = [
    (1, "Italy", "Europe"),
    (2, "Italy", "Europe"),
    (3, "Germany", None),
    (4, "Iran", "Asia"),
    (5, "China", "Asia"),
    (6, "China", None),
    (7, "Japan", "Asia"),
    (8, "France", None),
]

# Create DataFrame
columns = ["Code", "Country", "Continent"]
df_1 = spark.createDataFrame(data=data_1, schema=columns)
df_1.show(truncate=False)

enter image description here

# Prepare Data
data_2 = [
    (1, "Italy", "EUR", 11),
    (2, "Germany", "EUR", 12),
    (3, "China", "CNY", 13),
    (4, "Japan", "JPY", 14),
    (5, "France", "EUR", 15),
    (6, "Taiwan", "TWD", 16),
    (7, "USA", "USD", 17),
    (8, "India", "INR", 18),
]

# Create DataFrame
columns = ["Code", "Country", "Currency", "Sales"]
df_2 = spark.createDataFrame(data=data_2, schema=columns)
df_2.show(truncate=False)

enter image description here

I want all columns of the 1st DataFrame and only column "Currency" from the 2nd DataFrame. When I use left join:

output = df_1.join(df_2, ["Country"], "left")
output.show()

Now, there are two columns with name "Code" after Join operation.

enter image description here

Using drop columns:

output = df_1.join(df_2, ["Country"], "left").drop('Code', 'Sales')
output.show()

enter image description here

Both columns named "Code" are dropped. But, I want to keep column "Code" from the 1st DataFrame.

Any idea how to solve this issue?

Another question is how to make column "Code" as the left-most column in the resulting DataFrame after Join operation.


Solution

  • If you don't need columns from df_2, you can drop them before the join like this:

    output = df_1.join(
        df_2.select('Country', 'Currency'),
        ['Country'], 'left'
    )
    

    Note that you can also disambiguate two columns with the same name by specifying the dataframe they come from. e.g. df_1['Code']. So in your case, after the join, instead of using drop, you could use that to keep only the columns from df_1 and the Currency column:

    output = df_1\
        .join(df_2, ['Country'], 'left')\
        .select([df_1[c] for c in df_1.columns] + ['Currency'])