Search code examples
mysqlsqlbigdatatableau-apidistinct

Optomizing a simple query with 70mil rows to fit into Tableau


Noobie to SQL. I have a simple query here that is 70 million rows, and my work laptop will not handle the capacity when I import it into Tableau. Usually 20 million rows and less seem to work fine. Here's my problem.

Table name: Table1

Fields: UniqueID, State, Date, claim_type

Query:

SELECT uniqueID, states, claim_type, date

FROM table1

WHERE date >= '11-09-2021'

This gives me what I want, BUT, I can limit the query significantly if I count the number of uniqueIDs that have been used in 3 or more different states. I use this query to do that.

SELECT unique_id, count(distinct states), claim_type, date

FROM table1

WHERE date >= '11-09-2021'

GROUP BY Unique_id, claim_type, date

HAVING COUNT(DISTINCT states) > 3

The only issue is, when I put this query into Tableau it only displays the FIRST state a unique_id showed up in, and the first date it showed up. A unique_id shows up in multiple states over multiple dates, so when I use this count aggregation it's only giving me the first result and not the whole picture.

Any ideas here? I am totally lost and spent a whole business day trying to fix this

Expected output would be something like

uniqueID | state | claim type | Date

123 Ohio C 01-01-2021

123 Nebraska I 02-08-2021

123 Georgia D 03-08-2021


Solution

  • If your table is only of those four columns, and your queries are based on date ranges, your index must exist to help optimize that. If 70 mil records exist, how far back does that go... Years? If your data since 2021-09-11 is only say... 30k records, that should be all you are blowing through for your results.

    I would ensure you have the index based on (and in this order) (date, uniqueId, claim_type, states). Also, you mentioned you wanted a count of 3 OR MORE, your query > 3 will results in 4 or more unless you change to count(*) >= 3.

    Then, to get the entries you care about, you need

    SELECT date, uniqueID, claim_type
       FROM table1
       WHERE date >= '2021-09-11'
       group by date, uniqueID, claim_type
       having count( distinct states ) >= 3
    

    This would give just the 3-part qualifier for date/id/claim that HAD them. Then you would use THIS result set to get the other entries via

    select distinct
          date, uniqueID, claim_type, states
       from
          ( SELECT date, uniqueID, claim_type
               FROM table1
               WHERE date >= '2021-09-11'
               group by date, uniqueID, claim_type
               having count( distinct states ) >= 3 ) PQ
             JOIN Table1 t1
                on PQ.date = t1.date
               and PQ.UniqueID = t1.UniqueID
               and PQ.Claim_Type = t1.Claim_Type
    

    The "PQ" (preQuery) gets the qualified records. Then it joins back to the original table and grabs all records that qualified from the unique date/id/claim_type and returns all the states.