sql azure apache-spark pyspark azure-blob-storage

Spark - SQL query does not retrieve the same number of rows when using SELECT * or SELECT col1

Context

I am running an SQL query over PySpark, on top of an Azure Blob storage datalake. The lake is partitioned with several nested keys.

I ran two queries whose logic is the same to me regarding the filters, but actually query 1 returns 20 million rows, while query 2 returns 28 million rows.

Here is the structure of the blob storage container :

.
|-- company=first
|   |-- year=2021
|   |-- year=2022
|   |-- year=2023
|   `-- year=2024
|-- company=other
|   |-- year=2021
|   |-- year=2022
|   |-- year=2023
|   `-- year=2024
`-- company=second
    |-- year=2021
    |-- year=2022
    |-- year=2023
    `-- year=2024

Query 1

-- File : query_1.sql
SELECT
    col1,
    col2,
    col3
FROM parquet.`abfs://[email protected]/company=first/year=2023
WHERE col1 IN ('A','B','C')
UNION
SELECT
    col1,
    col2,
    col3
FROM parquet.`abfs://[email protected]/company=second/year=2023
WHERE col1 IN ('A','B','C')
UNION
SELECT
    col1,
    col2,
    col3
FROM parquet.`abfs://[email protected]/company=other/year=2023
WHERE col1 IN ('A','B','C')

Query 2

-- File : query_2.sql
SELECT *
FROM parquet.`abfs://[email protected]/company=first/year=2023
WHERE col1 IN ('A','B','C')
UNION
SELECT *
FROM parquet.`abfs://[email protected]/company=second/year=2023
WHERE col1 IN ('A','B','C')
UNION
SELECT *
FROM parquet.`abfs://[email protected]/company=other/year=2023
WHERE col1 IN ('A','B','C')

PySpark execution code

Here is the code used to run the Spark query from the SQL code

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

query1 = open("query_1.sql", mode="r").read()
df1 = spark.sql(query1)
df1.count()   # Gives 20 million rows

query2 = open("query_2.sql", mode="r").read()
df2 = spark.sql(query2)
df2.count()   # Gives 28 million rows

Question

Any idea what could be the cause of this gap ? Anything related to using PySpark / PySpark over SQL / the SQL query itself / something in the data itself ?

Solution

I agree with @NickW. Here duplicates seem to be the cause.

I have tried the following query where I have used select * with union and you can see I got 4 rows as output.

%sql
select  *  from One WHERE employee_name IN ('James','Maria','Jeff')
union
select  *  from Two WHERE employee_name IN ('James','Maria','Jeff')
union
select  *  from Three WHERE employee_name IN ('James','Maria','Jeff');

enter image description here

When I have tried your second case (select column with union), I got less rows.

%sql
select employee_name from One WHERE employee_name IN ('James','Maria','Jeff')
union
select employee_name from Two WHERE employee_name IN ('James','Maria','Jeff')
union
select employee_name from Three WHERE employee_name IN ('James','Maria','Jeff');

enter image description here

This is because, union removes the duplicates. In Second case I have selected only employee_name and it has duplicates(row count will be 3 after removing duplicates) whereas in the first case there won't be any duplicates because in that it is selecting all columns. This is same in your case as well. Check with UNION ALL as suggested in comments.