dataframe apache-spark pyspark apache-spark-sql

Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other.

Is there a way to replicate the following command:

sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")

by using only pyspark functions such as join(), select() and the like?

I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.

Solution

Not sure if the most efficient way, but this worked for me:

from pyspark.sql.functions import col

df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])

The trick is in:

[col('a.'+xx) for xx in a.columns] : all columns in a

[col('b.other1'),col('b.other2')] : some columns of b

how to make pandas merge without inner data
How to print a df with 156636 lines?
How to deal with SettingWithCopyWarning in Pandas
Pandas and bs4 html scraping
How to reduce the dimension of CSV file?
Replacing multiple rows with polars based on filter condition / equivalent to df.loc in pandas
Polars Shuffle And Split DataFrame With Grouping
Interpolate time series data from one df to time axis of another df in Python polars
Python Polars: Number of rows since last value >0
speeding up loop calculation of an integral
Implementing cum_count() for a subset of columns in polars
Iterate over rows polars rust
Differences between columns containing lists
Using for-loops to process multiple pandas dataframes
How many NAs do we have rowwise before the first numerical value in a dataframe?
Py4JJava Error on Azure Databricks notebook
Python Polars Window Function With Literal Type
How to count work days between date columns with Polars
How to find integer index of a string in a column in pandas dataframe?
Full string matching in Pandas dataframes comparison
Pandas set_levels on MultiIndex: Level values must be unique
data.frame rows to a list
How to create additional columns weekofyear,month and dayofweek in polars?
Why does casting a column with numeric Categorical datatype to an integer in Polars result in unexpected behavior?
How to find values that can be found in other columns in polars quickly
How to transform polars datetime column into a string column?
Polars - How can I make multiple joins cross multiple Dataframes, examples included
PyPolars, get value from column based on value in another column without for loop
PyPolars, conditional join on two columns
Pandas.eval replacement in polars