Search code examples
dataframepysparkapache-spark-sql

Pyspark Dataframe Column Arithmatic Operations


I have a dataframe in pysprak that has two columns, Flight Origin and Flight Destination.

Origin Destination
LAX JFK
LHR LAX
ATL DFW
JFK LHR
ATL LHR

What I want to do using Pyspark dataframes is to calculate how many times a single distinct value has appeared in both columns, for e.g LHR has appeared 3 times in both columns.

Expected Output

Airport Count
LAX 2
ATL 2
LHR 3
JFK 2
DFW 1

Is there any way I can count get a count for both columns?

I have used the Distinct filter using Pyspark DataFrames on single columns i.e. once on Origin and once on Destination and get a list for e.g.

Distinct count in Origin: LAX-1
ATL-2
LHR-1
JFK-1
and Distinct count in Destination: LAX-1
DFW-1
LHR-2
JFK-1

Now how do I perform addition on these columns?

Any help will be appreciated. Thanks


Solution

  • >>> from pyspark.sql import functions as F
    >>> df=spark.createDataFrame([('LAX','JFK'),
    ... ('ATL','DFW'),
    ... ('LHR','LAX'),
    ... ('JFK','LHR'),
    ... ('ATL','LHR')], schema=['origin','destination'])
    >>> o_count=df.select(F.col('origin').alias('place'))
    >>> d_count=df.select(F.col('destination').alias('place'))
    >>> o_count.unionAll(d_count).groupBy('place').count().show()
    +-----+-----+
    |place|count|
    +-----+-----+
    |  LHR|    3|
    |  DFW|    1|
    |  ATL|    2|
    |  LAX|    2|
    |  JFK|    2|
    +-----+-----+