Search code examples
scalabilityanalyticsmetrics

How do different visitor metrics relate?


Hypothetically, tets say someone tells you to to expect X (like 100,000 or something) number of unique visitors per day as a result of a successful marketing campaing.

How does that translate to peak requests/second? Peak simultaneous requests?

Obviously this depends on many factors, such as typical number of pages requested per user session or load time of a typical page, or many other things. These are other variables Y, Z, V, etc.

I'm looking for some sort of function or even just a ratio to estimate these metrics. Obviously for planing out the production environment scalability strategy.

This might happen on a production site I'm working on really soon. Any kind of help estimating these is useful.


Solution

  • Edit: (following indication that we have virtually NO prior statistics on the traffic)
    We can therefore forget about the bulk of the plan laid out below and directly get into the "run some estimates" part. The problem is that we'll need to fill-in parameters from the model using educated guesses (or plain wild guesses). The following is a simple model for which you can tweak the parameters based on your understanding of the situation.

    Model

    Assumptions:
    a) The distribution of page requests follows the normal distribution curve.
    b) Considering a short period during peak traffic, say 30 minutes, the number of requests can be considered to be evenly distributed.
    This could be [somewhat] incorrect: for example we could have a double curve if the ad campaign targets multiple geographic regions, say the US and the Asian markets. Also the curve could follow a different distribution. These assumptions are however good ones for the following reasons:

    • it would err, if at all, on the "pessimistic side" i.e. over-estimating peak traffic values. This "pessimistic" outlook can further be further adopted by using a slightly smaller std deviation value. (We suggest using 2 to 3 hours, which would put 68% and 95% of the traffic over a period of 4 and 8 hours (2h std dev) and 6 and 12 hours (3h stddev), respectively.
    • it makes for easy calculations ;-)
    • it is expected to generally match reality.

    Parameters:

    • V = expected number of distinct visitors per 24 hour period
    • Ppv = average number of page requests associated with a given visitor session. (you may consider using the formula twice, one for "static" type of responses, and the other for dynamic responses, i.e. when the application spends time crafting a response for a given user/context)
    • sig = std deviation in minutes
    • R = peak-time number of requests per minute.

    Formula:

       R = (V * Ppv * 0.0796)/(2 * sig / 10)
    

    That is because, with a normal distribution, and as per z-score table, roughly 3.98% of the samples fall within 1/10 of a std dev, on one or the other side of the mean (of the very peak), therefore get almost 8 percent of the samples within one tenth of a std dev on each side, and with the assumption of relatively even distribution during this period, we just divide by the number of minutes.

    Example: V=75,000 Ppv=12 and sig = 150 minutes (i.e 68% of traffic assumed to come over 5 hours, 95% over 10 hours, 5% for the other 14 hours of the day). R = 2,388 requests per minute, i.e. 40 requests per second. Rather Heavy, but "doable" (unless application takes 15 seconds per request...)

    Edit (Dec 2012)
    : I'm adding here an "executive summary" of the model proposed above, as provided in comments by full.stack.ex.
    In this model, we assume most people visit our system, say, at noon. That's the peak time. Others jump ahead or lag behind, the farther the fewer; Nobody at midnight. We chose a bell curve that reasonably covers all requests within the 24h period: with about 4 sigma on the left and 4 on the right, "nothing" significant remains in the long tail. To simulate the very peak, we cut out a narrow stripe around the noon and count the requests there.

    It is noteworthy that this model, in practice, tends to overestimate the peak traffic, and may prove to be more useful at estimating "worse case" scenario rather than more plausible traffic patterns. Tentative suggestions to improve the estimate are

    • to extend the sig parameter (to acknowledge that the effective traffic period of relatively high traffic is longer)
    • to reduce the overall amount of visits for the period considered i.e. reduce the V parameter, by say 20% (to acknowledge that about that many visit happen outside of any peak time)
    • to use a different distribution such as say the Poisson or some binomial distribution.
    • to consider that there are a number of peaks each day, and that the traffic curve is actually the sum of several normal (or other distribution) functions with similar spread, but with a distinct peak. Assuming that such peaks are sufficiently apart, we can use the original formula, only with a V factor divided by as many peaks as considered.

    [original response]
    It appears that your immediate concern is how the server(s) may handle the extra load... A very worthy concern ;-). Without distracting you from this operational concern, consider the process of estimating the scale of the upcoming surge, also provides an opportunity of preparing yourself to gather more and better intelligence about the site's traffic, during and beyond the ad-campaign. Such information will in time prove useful for making better estimates of surges etc, but also for guiding some of the site's design (for commercial efficiency as well as for improving scalability).

    A tentative plan

    Assume qualitative similarity with existing traffic.
    The ad campaign will expose the site to a distinct population (type of users) than its current visitors/users population: different situations select different subjects. For example the "ad campaign" visitors may be more impatient, focussed on a particular feature, concerned about price... as compared to the "self selected ?" visitors. Never the less, by lack of any other supporting model and measurement, and for sake of estimating load, the general principle could be to assume that the surge users will on-the-whole behave similarly to the self-selected crowd. A common approach is "run numbers" on this basis and to use educated guesses to slightly bend the coefficients of the model to accommodate for a few distinctive qualitative distinctions.

    Gather statistics about existing traffic
    Unless you readily have better information for this (eg. tealeaf, Google Analytics...) your source for such information may simply be the webserver's log... You can then build some simple tools to extract parse these logs and extract the following statistics. Note that these tools will be reusable for future analysis (eg: of the campaign itself), and also look for opportunities of logging more/different data, without significantly changing the application!

    • Average, Min, Max, Std Dev. for
      • number of pages visited per session
      • duration of a session
    • percentage of 24 hour traffic for each hour of a work day (exclude week-ends and such, unless of course this is a site which receives much traffic during these periods) These percentages should be calculated over several weeks at least to remove noise.

    "Run" some estimates:
    For example, start with peak use estimate, using the peak hour(s) percentage, the average daily session count, the average number of pages hits per session etc. This estimate should take into account the stochastic nature of traffic. Note that you don't have to, in this phase, worry about the impact of the queuing effect, instead, assume that the service time relative to the request period is low enough. Therefore just use a realistic estimate (or rather a value informed from the log analysis, for these very high usage periods), for the way the probability of a request is distributed over short periods (say of 15 minutes).

    Finally, based on the numbers you obtained in this fashion, you can get a feel for the type of substained load this would represent on the server, and plan to add resources, to refactor part of the application. Also -very important!- if the outlook for sustained at-capacity load, start running the Pollaczek-Khinchine formula, as suggested by ChrisW, to get a better estimate of the effective load.

    For extra credit ;-) Consider running some experiments during the campaign for example by randomly providing a distinct look or behavior for some of the pages visited, and by measuring the impact this may have (if any) on particular metrics (registration for more info, orders place, number of pages visited ...) The effort associated with this type of experiment may be significant, but the return can be significant as well, and if nothing else it may keep your "useability expert/consultant" on his/her toes ;-) You'll obviously want to work on defining such experiments, with the proper marketing/business authorities, and you may need to calculate ahead of time the minimum percentage of users upon which the alternate site would be proposed, to keep the experiment statistically representative. It is indeed important to know that the experiment doesn't need to be applied to 50% of the visitors; one can start small, just not so small that possible variations observed may be due to random...