Search code examples
sqlgoogle-bigquerycasedremel

Combining words (tomato,carrot) and display count of locales over time in Big Query [en,en-uk,en-sv,en-au as EN]


I am currently picking up SQL and have come across a problem.

This question pertains to many use cases for me. I have locales in en-uk, en-au, es-latam, es-spain that I want to combine simply as EN or ES for reporting over time.

Below are examples on asking how to count number of tomato page views, and number of carrot page views.

I am hoping once I figure out how to do this, I can apply this to language locales.

The intended output can be seen in this spreadsheet here in column H:L: https://docs.google.com/spreadsheets/d/1CNE__ikiHEQHedH0UiSPmRI1s47e7qEH_aJJVtYSSzU/edit?usp=sharing

Unfamiliar with CASE as I am beginning to start out on this journey, but I need to aggregate/summarize the data already lying in the table even more so that I may build a chart.

Can anyone point out any optimization areas? In addition, a side request: AND url.website like '%tomato%' or '%carrot%' (How do I make this an OR?)

Lastly, can anyone help me figure out how to use a NOT CONTAIN in AND url.website does not contain any of these words case insensitive (?i) potato,Mushroom,Celery

I am coming from a Spreadsheets background and am an advanced user, but seems as if I am having difficulty transferring this knowledge over to SQL.

Many thanks! And please let me know if you have any questions,

After many forum scourings, I understand I will need to insert a Subquery in here in order to get the intended table. I aim to plot the count of these website page visits in a graph over time.

[Most Recent EDIT]

Shows URL with Count number on the right hand side

SELECT
url.website
report.timestamp
count(url.website) as count
FROM
datatable.report
WHERE url.website like '%carrot%' OR url.website like '%tomato%'
Group by url.website

Error code: report.timestamp is not found in Group By - but if I add it, I get microseconds in one column, and the count of appearances of these microseconds.

The main gist is to add the report.timestamp in the select, so that I could plot by aggregated month, but once I do this, the count values are not summed.

[Past Edit 2]

  SELECT
  url.website
  COUNT(url.website) as Count
  (CASE WHEN report.web.url like '%carrot%' then 'carrot website'
  WHEN report.web.url like '%tomato%' then 'tomato website'
  ELSE 'other website'
  END)

  FROM datatable.report
  WHERE (product.tag = 12345)
  AND url.website NOT IN ('Potato','Mushroom','Celery')
  AND url.website like '%tomato%'
  GROUP BY url.website

[Past Edit 3]

 SELECT
 (CASE WHEN url.website like '%carrot%' THEN 'carrot'
 WHEN url.website like '%tomato%' THEN 'tomato'
 ELSE 'other'
 END)
 url.website
 COUNT(carrot) as carrotwebsite
 COUNT(fundamental) As tomatowebsite

*thinking that maybe I needed to case/group them first, then show the 
 count 
 displays.

Please see column H:L in the public spreadsheet: https://docs.google.com/spreadsheets/d/1CNE__ikiHEQHedH0UiSPmRI1s47e7qEH_aJJVtYSSzU/edit?usp=sharing


Solution

  • Below is for BigQuery Standard SQL

    #standardSQL
    SELECT FORMAT_DATE('%b %Y', PARSE_DATE('%m/%d/%Y', dt)) month_year, 
      COUNTIF(url LIKE '%tomato%') tomato_views,
      COUNTIF(url LIKE '%carrot%') carrot_views,
      COUNTIF(NOT url LIKE '%tomato%' AND NOT url LIKE '%carrot%') other_views
    FROM `project.dataset.table`
    GROUP BY month_year  
    

    You can test, play with above using sample/dummy data as in below example

    #standardSQL
    WITH `project.dataset.table` AS (
      SELECT '1/1/2019' dt, 'www.websiteurl.com/tomato/page1' url UNION ALL
      SELECT '1/10/2019', 'www.websiteurl.com/tomato/page2' UNION ALL
      SELECT '1/3/2019', 'www.websiteurl.com/tomato/page3' UNION ALL
      SELECT '2/4/2019', 'www.websiteurl.com/tomato/page4' UNION ALL
      SELECT '2/21/2019', 'www.websiteurl.com/tomato/page5' UNION ALL
      SELECT '2/7/2019', 'www.websiteurl.com/tomato/page6' UNION ALL
      SELECT '3/7/2019', 'www.websiteurl.com/tomato/page7' UNION ALL
      SELECT '3/15/2019', 'www.websiteurl.com/tomato/page8' UNION ALL
      SELECT '3/29/2019', 'www.websiteurl.com/tomato/page9' UNION ALL
      SELECT '3/16/2019', 'www.websiteurl.com/tomato/page10' UNION ALL
      SELECT '1/11/2019', 'www.websiteurl.com/carrot/page1' UNION ALL
      SELECT '1/12/2019', 'www.websiteurl.com/carrot/page2' UNION ALL
      SELECT '4/10/2019', 'www.websiteurl.com/carrot/page3' UNION ALL
      SELECT '4/10/2019', 'www.websiteurl.com/carrot/page4' UNION ALL
      SELECT '4/18/2019', 'www.websiteurl.com/carrot/page5' UNION ALL
      SELECT '1/16/2019', 'www.websiteurl.com/carrot/page6' UNION ALL
      SELECT '1/17/2019', 'www.websiteurl.com/carrot/page7' UNION ALL
      SELECT '1/18/2019', 'www.websiteurl.com/turnip/home' UNION ALL
      SELECT '1/19/2019', 'www.websiteurl.com/turnip/resources' 
    )
    SELECT FORMAT_DATE('%b %Y', PARSE_DATE('%m/%d/%Y', dt)) month_year, 
      COUNTIF(url LIKE '%tomato%') tomato_views,
      COUNTIF(url LIKE '%carrot%') carrot_views,
      COUNTIF(NOT url LIKE '%tomato%' AND NOT url LIKE '%carrot%') other_views
    FROM `project.dataset.table`
    GROUP BY month_year   
    

    with result

    Row month_year  tomato_views    carrot_views    other_views  
    1   Jan 2019    3               4               2    
    2   Feb 2019    3               0               0    
    3   Mar 2019    4               0               0    
    4   Apr 2019    0               3               0