Search code examples
clickhouse

SQL Query for Calculating Daily Share of Voice with Top Competitor Analysis


I need help about writing a query to analyze website traffic data from a serp_analytics table. The table structure is as follows:

column type
keyword_id int
website varchar
est_traffic int
date datetime

My objective is to calculate the daily "Share of Voice" for my website ('youtube.com') and identify the top 3 competing websites based on their share of voice. The analysis should cover a specific date range, provided as :start and :end parameters.

Specific Requirements:

  1. Calculate Daily Share of Voice: I need to compute the daily share of voice for 'youtube.com'. The share of voice is calculated as the sum of est_traffic for 'youtube.com' for a given day, divided by the total est_traffic for all websites on that day, multiplied by 100.

  2. Identify Top 3 Competitors: Based on the share of voice on the :end date, determine the top 3 websites (excluding 'youtube.com'). These top competitors should be included in the output for each day in the specified date range.

  3. Output Structure: The desired output is a structured format where each entry corresponds to a website, including 'youtube.com' and the top 3 competitors. Each entry should list the daily share of voice for the date range. If a website has no data on a given day, its share of voice should be shown as 0.

  4. Complete Date Coverage: The output must include all dates in the range between :start and :end. If there's no data for a website on a particular date, the share of voice for that date should be 0.

Here's an example dataset:

| keyword id | website      | est_traffic | date       |
|------------|--------------|-------------|------------|
| 1          | google.com   | 10          | 2023-12-22 |
| 1          | facebook.com | 20          | 2023-12-22 |
| 1          | youtube.com  | 5           | 2023-12-22 |
| 1          | twitter.com  | 40          | 2023-12-22 |
| 1          | linkedin.com | 50          | 2023-12-22 |
| 1          | google.com   | 30          | 2023-12-23 |
| 1          | facebook.com | 20          | 2023-12-23 |
| 1          | youtube.com  | 5           | 2023-12-23 |
| 1          | twitter.com  | 10          | 2023-12-23 |
| 1          | linkedin.com | 15          | 2023-12-23 |

Based on this, the output for 'youtube.com' between dates 2023-12-22 and 2023-12-23 should look something like this:

[
    {
        "domain": "youtube.com",
        "share_of_voice": [
            {"date": "2023-12-22", "value": 4},
            {"date": "2023-12-23", "value": 6.25}
        ]
    },
    // Entries for top 3 competitors
    {
        "domain": "google.com",
        "share_of_voice": [
            {"date": "2023-12-22", "value": 8},
            {"date": "2023-12-23", "value": 37.5}
        ]
    },
    {
        "domain": "twitter.com",
        "share_of_voice": [
            {"date": "2023-12-22", "value": 32},
            {"date": "2023-12-23", "value": 12.5}
        ]
    },
    {
        "domain": "linkedin.com",
        "share_of_voice": [
            {"date": "2023-12-22", "value": 40},
            {"date": "2023-12-23", "value": 18.75}
        ]
    },
]

If it is not possible to get output in format like above then return anything really it doesn't matter as long it has all data.

Thank you in advance for your assistance!


Solution

  • Some comments are in the statement.

    WITH 
      toDate ('2023-12-22') as dt_start
    , toDate ('2023-12-23') as dt_end
    , 'youtube.com'         as site2exclude
    --, 'google.com'            as site2exclude
    SELECT *
    FROM
    (
    SELECT
      *
    -- propagation of the end day enumeration to all other days
    , sum (num_end) over (partition by website)
        as num
    FROM
    (
    SELECT 
      *
    -- enumerating rows for the end day only, the excluded site gets the last number
    , if 
      (
          date = dt_end
        , row_number () over (partition by date order by if (website = site2exclude, 0, share_of_voice) desc)
        , 0
      ) as num_end
    FROM
    (
    SELECT 
      *
    -- share_of_voice computation for each day
    , round (100 * est_traffic / sum (est_traffic) over (partition by date), 2)
        as share_of_voice
    FROM VALUES 
    (
      'website String, est_traffic UInt32, date Date'
    , ('linkedin.com', 50, toDate ('2023-12-22'))
    , ('twitter.com' , 40, toDate ('2023-12-22'))
    , ('facebook.com', 20, toDate ('2023-12-22'))
    , ('google.com'  , 10, toDate ('2023-12-22'))
    , ('youtube.com' ,  5, toDate ('2023-12-22'))
    , ('google.com'  , 30, toDate ('2023-12-23'))
    , ('facebook.com', 20, toDate ('2023-12-23'))
    , ('linkedin.com', 15, toDate ('2023-12-23'))
    , ('twitter.com' , 10, toDate ('2023-12-23'))
    , ('youtube.com' ,  5, toDate ('2023-12-23'))
    )
    WHERE date between dt_start and dt_end
    )
    )
    )
    -- final selection of the excluded site and 3 other competitors
    WHERE num between 1 and 3 or website = site2exclude
    ORDER BY date, num
    

    The result (with a couple of additional fields just for visualization of intermediate computations) is:

    |website     |est_traffic|date      |share_of_voice|num_end|num|
    |------------|-----------|----------|--------------|-------|---|
    |google.com  |10         |2023-12-22|8             |0      |1  |
    |facebook.com|20         |2023-12-22|16            |0      |2  |
    |linkedin.com|50         |2023-12-22|40            |0      |3  |
    |youtube.com |5          |2023-12-22|4             |0      |5  |
    |google.com  |30         |2023-12-23|37.5          |1      |1  |
    |facebook.com|20         |2023-12-23|25            |2      |2  |
    |linkedin.com|15         |2023-12-23|18.75         |3      |3  |
    |youtube.com |5          |2023-12-23|6.25          |5      |5  |