I want to join 2 PySpark DataFrames. But, I want all columns from one DataFrame, and some of columns from the 2nd DataFrame. The point is that there is a column with similar name in both DataFrames.
Sample DataFrames:
# Prepare Data
data_1 = [
(1, "Italy", "Europe"),
(2, "Italy", "Europe"),
(3, "Germany", None),
(4, "Iran", "Asia"),
(5, "China", "Asia"),
(6, "China", None),
(7, "Japan", "Asia"),
(8, "France", None),
]
# Create DataFrame
columns = ["Code", "Country", "Continent"]
df_1 = spark.createDataFrame(data=data_1, schema=columns)
df_1.show(truncate=False)
# Prepare Data
data_2 = [
(1, "Italy", "EUR", 11),
(2, "Germany", "EUR", 12),
(3, "China", "CNY", 13),
(4, "Japan", "JPY", 14),
(5, "France", "EUR", 15),
(6, "Taiwan", "TWD", 16),
(7, "USA", "USD", 17),
(8, "India", "INR", 18),
]
# Create DataFrame
columns = ["Code", "Country", "Currency", "Sales"]
df_2 = spark.createDataFrame(data=data_2, schema=columns)
df_2.show(truncate=False)
I want all columns of the 1st DataFrame and only column "Currency" from the 2nd DataFrame. When I use left join:
output = df_1.join(df_2, ["Country"], "left")
output.show()
Now, there are two columns with name "Code" after Join operation.
Using drop columns:
output = df_1.join(df_2, ["Country"], "left").drop('Code', 'Sales')
output.show()
Both columns named "Code" are dropped. But, I want to keep column "Code" from the 1st DataFrame.
Any idea how to solve this issue?
Another question is how to make column "Code" as the left-most column in the resulting DataFrame after Join operation.
If you don't need columns from df_2
, you can drop them before the join like this:
output = df_1.join(
df_2.select('Country', 'Currency'),
['Country'], 'left'
)
Note that you can also disambiguate two columns with the same name by specifying the dataframe they come from. e.g. df_1['Code']
. So in your case, after the join, instead of using drop
, you could use that to keep only the columns from df_1
and the Currency
column:
output = df_1\
.join(df_2, ['Country'], 'left')\
.select([df_1[c] for c in df_1.columns] + ['Currency'])