I have following SQL query, which I translated to HQL:
SELECT f.date,
f.name,
SUM(f.seats)
FROM Foo f
WHERE EXISTS ( SELECT 1
FROM Foo fh
WHERE f.start + f.end IN ( SELECT fl.start + fl.end
FROM Foo fl
WHERE fl.date BETWEEN dateadd(yy,-1,fh.date)
AND fh.date
AND fl.name = '<name>')
AND f.date = fh.date
AND fh.date >= '2016-01-01'
AND fh.name = '<name>' )
AND f.date >= '2016-01-01'
GROUP BY f.date,
f.name
ORDER BY f.date ASC,
SUM(f.seats) DESC
In my application this query causes the error in the title:
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Could not allocate a new page for database 'TEMPDB' because of insufficient disk space in filegroup 'DEFAULT'. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet$FetchBuffer.nextRow(SQLServerResultSet.java:4853)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.fetchBufferNext(SQLServerResultSet.java:1781)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.next(SQLServerResultSet.java:1034)
at org.apache.commons.dbcp2.DelegatingResultSet.next(DelegatingResultSet.java:191)
at org.apache.commons.dbcp2.DelegatingResultSet.next(DelegatingResultSet.java:191)
at org.hibernate.loader.Loader.processResultSet(Loader.java:986)
at org.hibernate.loader.Loader.doQuery(Loader.java:948)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:340)
at org.hibernate.loader.Loader.doList(Loader.java:2689)
This is obviously cause by the sheer inefficiency of the query as well as the number of times its executed as well as the amount of rows it works with.
Lets explain what the query does. Following example:
I have data on uber drivers. Each row is one drive of the driver with date (month), driver name, seats the driver had available, start location and end location.
E. g.:
Date Name Seats Start End
-------------------------------------------
7/1/2019 John 45 A B
The data is summed per month. So John
had 9 drives between A
and B
and each time he had 5 seats available. Of course, there are also other people who drive the same route and therefore pose as competition to John
.
Date Name Seats Start End
-------------------------------------------
7/1/2019 John 45 A B
7/1/2019 Doe 25 A A
7/1/2019 Alice 35 A C
7/1/2019 John 30 A A
7/1/2019 Doe 25 A C
7/1/2019 Alice 10 A B
7/1/2019 Doe 5 A B
7/1/2019 Alice 15 A A
So for 7/1/2019
John
s "network" (all routes) had this competition:
Date Name Seats Route
---------------------------------
7/1/2019 John 30 A-A
7/1/2019 Doe 25 A-A
7/1/2019 Alice 15 A-A
7/1/2019 John 45 A-B
7/1/2019 Doe 5 A-B
7/1/2019 Alice 10 A-B
As you can see, in this result, the route A-C
is not listed, because John
didn't drive it at all. If we expand the example data by a new month 8/1/2019
:
Date Name Seats Start End
-------------------------------------------
8/1/2019 John 65 A C
8/1/2019 Doe 25 A A
8/1/2019 Alice 35 A A
8/1/2019 Doe 25 A B
8/1/2019 Alice 10 A B
8/1/2019 Doe 5 A C
8/1/2019 Alice 15 A C
we can see that John
only drove A-C
this month. As the network
should be build over a timespan of 1 year into the past (8/1/2018 to 8/1/2019), John
s network is now all three routes (A-A
, A-B
, A-C
), but only for calculating the competitors as of 8/1/2019
. For 7/1/2019
, John
s network stays A-A
, A-B
. So the result for 8/1/2019
is this:
Date Name Seats Route
---------------------------------
8/1/2019 John 0 A-A
8/1/2019 Doe 25 A-A
8/1/2019 Alice 35 A-A
8/1/2019 John 0 A-B
8/1/2019 Doe 25 A-B
8/1/2019 Alice 10 A-B
8/1/2019 John 65 A-C
8/1/2019 Doe 5 A-C
8/1/2019 Alice 10 A-C
John
only drove A-C
, which is the reason, why he is counted with 0 seats for the other routes.
As the results are summing the seats and ignore the route, the actual output of the query is as following:
7/1/2019 John 75 <-- 30+45
7/1/2019 Doe 30 <-- 25+5
7/1/2019 Alice 25 <-- 10+15
8/1/2019 John 65 <-- 65+0+0
8/1/2019 Doe 55 <-- 25+25+5
8/1/2019 Alice 55 <-- 35+10+10
In this result we have as routes for 7/1/2019
only A-A
and A-B
for John
s competitors, as there is no data before that date. For 8/1/2019
John
s network is A-A
, A-B
and A-C
, even though he only drove A-C
in 8/1/2019
(A-A
and A-B
were in 7/1/2019
).
I hope the data I provided is understandable. If you need more clarification, just ask and I will try to explain even more.
How do I need to change my query to greatly increase the performance?
I have not used JOIN
s so far, because I had to join on a subquery, and that is not allowed in HQL.
If you need more info/clarification, feel free to ask!
EDIT:
I know that I could have also posted on codereview.stackexchange.com, but I chose against it, because the query itself works, if only performed for 1 name and only fails for more. My understanding of codereview.stackexchange.com
is, that there should only be performance improvement questions
I came up with this query shortly after posting the question:
SELECT f.date,
f.name,
SUM(f.seats)
FROM Foo f
WHERE f.start + f.end IN ( SELECT fh.start + fh.end
FROM Foo fh
WHERE fh.date BETWEEN DATEADD(yy, -1, f.date)
AND f.date
AND fh.name = '<name>')
AND f.date >= '2016-01-01'
GROUP BY f.date,
f.name
ORDER BY f.date ASC,
SUM(f.seats) DESC
As you can see, I pretty much just removed the WHERE EXISTS
clause. I am not quite certain if that was the right thing to do, or what errors this could cause, but it at least solved the error at hand (insufficient disk space exception).
If you have another idea or remarks on my query, feel free to share them!