I have a dataframe in pysprak that has two columns, Flight Origin and Flight Destination.
Origin | Destination |
---|---|
LAX | JFK |
LHR | LAX |
ATL | DFW |
JFK | LHR |
ATL | LHR |
What I want to do using Pyspark dataframes is to calculate how many times a single distinct value has appeared in both columns, for e.g LHR has appeared 3 times in both columns.
Expected Output
Airport | Count |
---|---|
LAX | 2 |
ATL | 2 |
LHR | 3 |
JFK | 2 |
DFW | 1 |
Is there any way I can count get a count for both columns?
I have used the Distinct filter using Pyspark DataFrames on single columns i.e. once on Origin and once on Destination and get a list for e.g.
Distinct count in Origin:
LAX-1
ATL-2
LHR-1
JFK-1
and
Distinct count in Destination:
LAX-1
DFW-1
LHR-2
JFK-1
Now how do I perform addition on these columns?
Any help will be appreciated. Thanks
>>> from pyspark.sql import functions as F
>>> df=spark.createDataFrame([('LAX','JFK'),
... ('ATL','DFW'),
... ('LHR','LAX'),
... ('JFK','LHR'),
... ('ATL','LHR')], schema=['origin','destination'])
>>> o_count=df.select(F.col('origin').alias('place'))
>>> d_count=df.select(F.col('destination').alias('place'))
>>> o_count.unionAll(d_count).groupBy('place').count().show()
+-----+-----+
|place|count|
+-----+-----+
| LHR| 3|
| DFW| 1|
| ATL| 2|
| LAX| 2|
| JFK| 2|
+-----+-----+