I am running an SQL query over PySpark, on top of an Azure Blob storage datalake. The lake is partitioned with several nested keys.
I ran two queries whose logic is the same to me regarding the filters, but actually query 1 returns 20 million rows, while query 2 returns 28 million rows.
Here is the structure of the blob storage container :
.
|-- company=first
| |-- year=2021
| |-- year=2022
| |-- year=2023
| `-- year=2024
|-- company=other
| |-- year=2021
| |-- year=2022
| |-- year=2023
| `-- year=2024
`-- company=second
|-- year=2021
|-- year=2022
|-- year=2023
`-- year=2024
-- File : query_1.sql
SELECT
col1,
col2,
col3
FROM parquet.`abfs://[email protected]/company=first/year=2023
WHERE col1 IN ('A','B','C')
UNION
SELECT
col1,
col2,
col3
FROM parquet.`abfs://[email protected]/company=second/year=2023
WHERE col1 IN ('A','B','C')
UNION
SELECT
col1,
col2,
col3
FROM parquet.`abfs://[email protected]/company=other/year=2023
WHERE col1 IN ('A','B','C')
-- File : query_2.sql
SELECT *
FROM parquet.`abfs://[email protected]/company=first/year=2023
WHERE col1 IN ('A','B','C')
UNION
SELECT *
FROM parquet.`abfs://[email protected]/company=second/year=2023
WHERE col1 IN ('A','B','C')
UNION
SELECT *
FROM parquet.`abfs://[email protected]/company=other/year=2023
WHERE col1 IN ('A','B','C')
Here is the code used to run the Spark query from the SQL code
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
query1 = open("query_1.sql", mode="r").read()
df1 = spark.sql(query1)
df1.count() # Gives 20 million rows
query2 = open("query_2.sql", mode="r").read()
df2 = spark.sql(query2)
df2.count() # Gives 28 million rows
Any idea what could be the cause of this gap ? Anything related to using PySpark / PySpark over SQL / the SQL query itself / something in the data itself ?
I agree with @NickW. Here duplicates seem to be the cause.
I have tried the following query where I have used select *
with union
and you can see I got 4 rows as output.
%sql
select * from One WHERE employee_name IN ('James','Maria','Jeff')
union
select * from Two WHERE employee_name IN ('James','Maria','Jeff')
union
select * from Three WHERE employee_name IN ('James','Maria','Jeff');
When I have tried your second case (select column
with union
), I got less rows.
%sql
select employee_name from One WHERE employee_name IN ('James','Maria','Jeff')
union
select employee_name from Two WHERE employee_name IN ('James','Maria','Jeff')
union
select employee_name from Three WHERE employee_name IN ('James','Maria','Jeff');
This is because, union
removes the duplicates. In Second case I have selected only employee_name
and it has duplicates(row count will be 3 after removing duplicates) whereas in the first case there won't be any duplicates because in that it is selecting all columns. This is same in your case as well. Check with UNION ALL
as suggested in comments.