I need help about writing a query to analyze website traffic data from a serp_analytics
table. The table structure is as follows:
column | type |
---|---|
keyword_id | int |
website | varchar |
est_traffic | int |
date | datetime |
My objective is to calculate the daily "Share of Voice" for my website ('youtube.com') and identify the top 3 competing websites based on their share of voice. The analysis should cover a specific date range, provided as :start
and :end
parameters.
Specific Requirements:
Calculate Daily Share of Voice: I need to compute the daily share of voice for 'youtube.com'. The share of voice is calculated as the sum of est_traffic
for 'youtube.com' for a given day, divided by the total est_traffic
for all websites on that day, multiplied by 100.
Identify Top 3 Competitors: Based on the share of voice on the :end
date, determine the top 3 websites (excluding 'youtube.com'). These top competitors should be included in the output for each day in the specified date range.
Output Structure: The desired output is a structured format where each entry corresponds to a website, including 'youtube.com' and the top 3 competitors. Each entry should list the daily share of voice for the date range. If a website has no data on a given day, its share of voice should be shown as 0.
Complete Date Coverage: The output must include all dates in the range between :start
and :end
. If there's no data for a website on a particular date, the share of voice for that date should be 0.
Here's an example dataset:
| keyword id | website | est_traffic | date |
|------------|--------------|-------------|------------|
| 1 | google.com | 10 | 2023-12-22 |
| 1 | facebook.com | 20 | 2023-12-22 |
| 1 | youtube.com | 5 | 2023-12-22 |
| 1 | twitter.com | 40 | 2023-12-22 |
| 1 | linkedin.com | 50 | 2023-12-22 |
| 1 | google.com | 30 | 2023-12-23 |
| 1 | facebook.com | 20 | 2023-12-23 |
| 1 | youtube.com | 5 | 2023-12-23 |
| 1 | twitter.com | 10 | 2023-12-23 |
| 1 | linkedin.com | 15 | 2023-12-23 |
Based on this, the output for 'youtube.com' between dates 2023-12-22 and 2023-12-23 should look something like this:
[
{
"domain": "youtube.com",
"share_of_voice": [
{"date": "2023-12-22", "value": 4},
{"date": "2023-12-23", "value": 6.25}
]
},
// Entries for top 3 competitors
{
"domain": "google.com",
"share_of_voice": [
{"date": "2023-12-22", "value": 8},
{"date": "2023-12-23", "value": 37.5}
]
},
{
"domain": "twitter.com",
"share_of_voice": [
{"date": "2023-12-22", "value": 32},
{"date": "2023-12-23", "value": 12.5}
]
},
{
"domain": "linkedin.com",
"share_of_voice": [
{"date": "2023-12-22", "value": 40},
{"date": "2023-12-23", "value": 18.75}
]
},
]
If it is not possible to get output in format like above then return anything really it doesn't matter as long it has all data.
Thank you in advance for your assistance!
Some comments are in the statement.
WITH
toDate ('2023-12-22') as dt_start
, toDate ('2023-12-23') as dt_end
, 'youtube.com' as site2exclude
--, 'google.com' as site2exclude
SELECT *
FROM
(
SELECT
*
-- propagation of the end day enumeration to all other days
, sum (num_end) over (partition by website)
as num
FROM
(
SELECT
*
-- enumerating rows for the end day only, the excluded site gets the last number
, if
(
date = dt_end
, row_number () over (partition by date order by if (website = site2exclude, 0, share_of_voice) desc)
, 0
) as num_end
FROM
(
SELECT
*
-- share_of_voice computation for each day
, round (100 * est_traffic / sum (est_traffic) over (partition by date), 2)
as share_of_voice
FROM VALUES
(
'website String, est_traffic UInt32, date Date'
, ('linkedin.com', 50, toDate ('2023-12-22'))
, ('twitter.com' , 40, toDate ('2023-12-22'))
, ('facebook.com', 20, toDate ('2023-12-22'))
, ('google.com' , 10, toDate ('2023-12-22'))
, ('youtube.com' , 5, toDate ('2023-12-22'))
, ('google.com' , 30, toDate ('2023-12-23'))
, ('facebook.com', 20, toDate ('2023-12-23'))
, ('linkedin.com', 15, toDate ('2023-12-23'))
, ('twitter.com' , 10, toDate ('2023-12-23'))
, ('youtube.com' , 5, toDate ('2023-12-23'))
)
WHERE date between dt_start and dt_end
)
)
)
-- final selection of the excluded site and 3 other competitors
WHERE num between 1 and 3 or website = site2exclude
ORDER BY date, num
The result (with a couple of additional fields just for visualization of intermediate computations) is:
|website |est_traffic|date |share_of_voice|num_end|num|
|------------|-----------|----------|--------------|-------|---|
|google.com |10 |2023-12-22|8 |0 |1 |
|facebook.com|20 |2023-12-22|16 |0 |2 |
|linkedin.com|50 |2023-12-22|40 |0 |3 |
|youtube.com |5 |2023-12-22|4 |0 |5 |
|google.com |30 |2023-12-23|37.5 |1 |1 |
|facebook.com|20 |2023-12-23|25 |2 |2 |
|linkedin.com|15 |2023-12-23|18.75 |3 |3 |
|youtube.com |5 |2023-12-23|6.25 |5 |5 |