Search code examples
sqlcountvertica

Vertica SQL for running count distinct and running conditional count


I'm trying to build a department level score table based on a deeper product url level score table.

  1. Date is not consecutive

  2. Not all urls got score updates at same day (independent to each other)

  3. dist_url should be running count distinct (cumulative count distinct)

  4. dist urls and urls score >=30 are both count distinct

What I have now is:

   Date  url   Store   Dept   Page   Score   
   10/1   a      US      A      X      10   
   10/1   b      US      A      X      30  
   10/1   c      US      A      X      60
   10/4   a      US      A      X      20  
   10/4   d      US      A      X      60
   10/6   b      US      A      X      22 
   10/9   a      US      A      X      40
   10/9   e      US      A      X      10


   Date  Store   Dept   Page   dist urls   urls score >=30  
   10/1   US      A      X          3          2 
   10/4   US      A      X          4          3
   10/6   US      A      X          4          2
   10/9   US      A      X          5          2

I think the dist_url can be done by using window function, just not sure on query.

Current query is as below, but it's wrong since not cumulative count distinct:

   SELECT
        bm.AnalysisDate,
        su.SoID         AS Store,
        su.DptCaID      AS DTID,
        su.PageTypeID   AS PTID,
        COUNT(DISTINCT bm.SeoURLID) AS NumURLsWithDupScore,
        SUM(CASE WHEN bm.DuplicationScore > 30 THEN 1 ELSE 0 END) AS Over30Count
    FROM csn_seo.tblBotifyMetrics bm 
    INNER JOIN csn_seo.tblSEOURLs su 
        ON bm.SeoURLID = su.ID
    WHERE su.DptCaID IS NOT NULL 
        AND su.DptCaID <> 0    
        AND su.PageTypeID IS NOT NULL
        AND su.PageTypeID <> -1
        AND bm.iscompliant = 1
    GROUP BY bm.AnalysisDate, su.SoID, su.DptCaID, su.PageTypeID;

Please let me know if anyone has any idea.


Solution

  • Based on your question, you seem to want two levels of logic:

    select date, store, dept,
           sum(sum(start)) over (partition by dept, page order by date) as distinct_urls,
           sum(sum(start_30)) over (partition by dept, page order by date) as distinct_urls_30
    from ((select store, dept, page, url, min(date) as date, 1 as start, 0 as start_30
           from t
           group by store, dept, page, url 
          ) union all
          (select store, dept, page, url, min(date) as date, 0, 1
           from t
           where score >= 30
           group by store, dept, page, url 
          )
         ) t
    group by date, store, dept, page;
    

    I don't understand how your query is related to your question.