Search code examples
pythonpysparkdatabricks

Calculate the count of distinct values appearing in multiple tables


I have three pyspark dataframes in Databricks: raw_old, raw_new, and master_df. These are placeholders to work out the logic on a smaller scale (actual tables contain billions of rows of data). There is a column in all three called label. I want to calculate the number of labels that appear in:

  • raw_old and raw_new (the answer is 3: A789, B456, D123)
  • raw_new and master_df (the answer is 2: C456, D123)
  • raw_old and master_df (the answer is 4: A654, B987, C987, D123)
  • raw_old, raw_new, and master_df (the answer is 1: D123)

The three tables are below. How do I calculate the above bullet points?

raw_old

+---+-----+
| id|label|
+---+-----+
|  1| A987|
|  2| A654|
|  3| A789|
|  4| B321|
|  5| B456|
|  6| B987|
|  7| C321|
|  8| C654|
|  9| C987|
| 10| D123|
+---+-----+

raw_new

+---+-----+
| id|label|
+---+-----+
|  1| A123|
|  2| A456|
|  3| A789|
|  4| B123|
|  5| B456|
|  6| B789|
|  7| C123|
|  8| C456|
|  9| C789|
| 10| D123|
+---+-----+

master_df

+---+-----+
| id|label|
+---+-----+
|  1| A999|
|  2| A654|
|  3| A000|
|  4| B111|
|  5| B000|
|  6| B987|
|  7| C999|
|  8| C456|
|  9| C987|
| 10| D123|
+---+-----+

Solution

  • You should use an inner join to get the elements in common between the datasets

    joined_data = raw_old.join(
        raw_new,
        on=raw_old["label"] == raw_new["label"],
        how="inner"
    )
    

    and then you can collect the result back to Python, keeping all the heavy work in Spark

    print(joined_data.count())
    

    When joining 3 dataframes, you can do the first 2 and join the resulted dataframe with the third one.