How to correctly determine fastest CDN, mirror, download server in C++

The question that I'm struggling with is how to determine in c++ that which is the server with fastest connection for the client do make git clone from or download tarball. So basically I want to choose from collection of known mirrors which one will be used for downloading content from.

Following code I wrote demonstrates that what I am trying to achieve more clearly perhaps, but I believe that's not something one should use in production :).

So lets say I have two known source mirrors git-1.exmple.com and git-2.example.com and I want to download tag-x.tar.gz from one which client has best connectivity to.

CDN.h

#include <iostream>
#include <cstdio>
#include <cstring>
#include <cstdlib>
#include <netdb.h>
#include <arpa/inet.h>
#include <sys/time.h>
using namespace std;

class CDN {
public:
    long int dl_time;
    string host;
    string proto;
    string path;
    string dl_speed;
    double kbs;
    double mbs;
    double sec;
    long int ms;
    CDN(string, string, string);
    void get_download_speed();
    bool operator < (const CDN&);
};
#endif

CDN.cpp

#include "CND.h"
CDN::CDN(string protocol, string hostname, string downloadpath)
{
    proto = protocol;
    host = hostname;
    path = downloadpath;
    dl_time = ms = sec = mbs = kbs = 0;
    get_download_speed();
}
void CDN::get_download_speed()
{
    struct timeval dl_started;
    gettimeofday(&dl_started, NULL);
    long int download_start = ((unsigned long long) dl_started.tv_sec * 1000000) + dl_started.tv_usec;
    char buffer[256];
    char cmd_output[32];
    sprintf(buffer,"wget -O /dev/null --tries=1 --timeout=2 --no-dns-cache --no-cache %s://%s/%s 2>&1 | grep -o --color=never \"[0-9.]\\+ [KM]*B/s\"",proto.c_str(),host.c_str(),path.c_str());
    fflush(stdout);
    FILE *p = popen(buffer,"r");

    fgets(cmd_output, sizeof(buffer), p);
    cmd_output[strcspn(cmd_output, "\n")] = 0;
    pclose(p);

    dl_speed = string(cmd_output);
    struct timeval download_ended;
    gettimeofday(&download_ended, NULL);
    long int download_end = ((unsigned long long)download_ended.tv_sec * 1000000) + download_ended.tv_usec;

    size_t output_type_k = dl_speed.find("KB/s");
    size_t output_type_m = dl_speed.find("MB/s");

    if(output_type_k!=string::npos) {
        string dl_bytes = dl_speed.substr(0,output_type_k-1);
        double dl_mb = atof(dl_bytes.c_str()) / 1000;
        kbs = atof(dl_bytes.c_str());
        mbs = dl_mb;
    } else if(output_type_m!=string::npos) {
        string dl_bytes = dl_speed.substr(0,output_type_m-1);
        double dl_kb = atof(dl_bytes.c_str()) * 1000;
        kbs = dl_kb;
        mbs = atof(dl_bytes.c_str());
    } else {
        cout << "Should catch the errors..." << endl;
    }
    ms = download_end-download_start;
    sec = ((float)ms)/CLOCKS_PER_SEC;
}
bool CDN::operator < (const CDN& other)
{
    if (dl_time < other.dl_time)
       return true;
    else
       return false;
}

main.cpp

#include "CDN.h"
int main() 
{
    cout << "Checking CDN's" << endl;
    char msg[128];
    CDN cdn_1 = CDN("http","git-1.example.com","test.txt");
    CDN cdn_2 = CDN("http","git-2.example.com","test.txt");
    if(cdn_2 > cdn_1)
    {
        sprintf(msg,"Downloading tag-x.tar.gz from %s %s since it's faster than %s %s",
        cdn_1.host.c_str(),cdn_1.dl_speed.c_str(),cdn_2.host.c_str(),cdn_2.dl_speed.c_str());
        cout << msg << endl;

    }
    else
    {
        sprintf(msg,"Downloading tag-x.tar.gz from %s %s since it's faster than %s %s",
        cdn_2.host.c_str(),cdn_2.dl_speed.c_str(),cdn_1.host.c_str(),cdn_1.dl_speed.c_str());
        cout << msg << endl;
    }
    return 0;
}

So what are your thoughts and how would you approach this. What are the alternatives to replace this wget and achieve same clean way in c++

EDIT: As @molbdnilo pointed correctly

ping measures latency, but you're interested in throughput.

So therefore I edited the demonstrating code to reflect that, however question remains same

Solution

For starters, trying to determine "fastest CDN mirror" is an inexact science. There is no universally accepted definition of what "fastest" means. The most one can hope for, here, is to choose a reasonable heuristic for what "fastest" means, and then measure this heuristic as precisely as can be under the circumstances.

In the code example here, the chosen heuristic seems to be how long it takes to download a sample file from each mirror via HTTP.

That's not such a bad choice to make, actually. You could reasonably make an argument that some other heuristic might be slightly better, but the basic test of how long it takes to transfer a sample file, from each candidate mirror, I would think is a very reasonable heuristic.

The big, big problem here I see here is the actual implementation of this heuristic. The way that this attempt -- to time the sample download -- is made, here, does not appear to be very reliable, and it will end up measuring a whole bunch of unrelated factors that have nothing do with network bandwidth.

I see at least several opportunities here where external factors completely unrelated to network throughput will muck up the measured timings, and make them less reliable than they should be.

So, let's take a look at the code, and see how it attempts to measure network latency. Here's the meat of it:

sprintf(buffer,"wget -O /dev/null --tries=1 --timeout=2 --no-dns-cache --no-cache %s://%s/%s 2>&1 | grep -o --color=never \"[0-9.]\\+ [KM]*B/s\"",proto.c_str(),host.c_str(),path.c_str());
fflush(stdout);
FILE *p = popen(buffer,"r");

fgets(cmd_output, sizeof(buffer), p);
cmd_output[strcspn(cmd_output, "\n")] = 0;
pclose(p);

... and gettimeofday() gets used to sample the system clock before and after, to figure out how long this took. Ok, that's great. But what would this actually measure?

It helps a lot here, to take a blank piece of paper, and just write down everything that happens here as part of the popen() call, step by step:

1) A new child process is fork()ed. The operating system kernel creates a new child process.

2) The new child process exec()s /bin/bash, or your default system shell, passing in a long string that starts with "wget", followed by a bunch of other parameters that you see above.

3) The operating system kernel loads "/bin/bash" as the new child process. The kernel loads and opens any and all shared libraries that the system shell normally needs to run.

4) The system shell process initializes. It reads the $HOME/.bashrc file and executes it, most likely, together with any standard shell initialization files and scripts that your system shell normally does. That itself can create a bunch of new processes, that have to be initialized and executed, before the new system shell process actually gets around to...

5) ...parsing the "wget" command it originally received as an argument, and exec()uting it.

6) The operating system kernel now loads "wget" as the new child process. The kernel loads and open any and all shared libraries that the wget process needs. Looking at my Linux box, "wget" loads no less than 25 separate shared libraries, including kerberos, and ssl libraries. Each one of those shared libraries get initialized.

7) The wget command performs a DNS lookup on the host, to obtain the IP address of the web server to connect to. If the local DNS server doesn't have the CDN mirror's hostname's IP address cached, it often takes several seconds to look up the CDN mirrors's DNS zone's authoritative DNS servers, then query them for the IP address, hopping this way and that way, across the intertubes.

Now, one moment... I seem have forgotten what we were trying to do here... Oh, I remember: which CDN mirror is "fastest", by downloading a sample file from each mirror, right? Yeah, that must be it!

Now, what does all of work done above, all of that work, have to do with determining which content mirror is the fastest???

Err... Not much, from the way it looks to me. Now, none of the above should really be such shocking news. After all, all of that is described in popen()'s manual page. If you read popen's manual page, it tells you that's ...what it does. Starts a new child process. Then executes the system shell, in order to execute the requested command. Etc, etc, etc...

Now, we're not talking about measuring time intervals that last many seconds, or minutes. If we're trying to measure something that takes a long time to execute, the relative overhead of popen()'s approach would be negligible, and not much to worry about. But the expected time to download the sample file, for the purpose of figuring out how fast each content mirror is -- I would expect that the actual download time would be relatively short. But it seems to me that the overhead to doing it this way, of forking an entirely new process, and executing first the system shell, then the wget command, with its massive list of dependencies, is going to be statistically significant.

And as I mentioned in the beginning, given that this is trying to determine the vaguely nebulous concept of "fastest mirror", which is already an inexact science -- it seems to me that you'd really want to get rid of as much utterly irrelevant overhead here -- as much as possible, in order to get as accurate of a result.

So, it seems to me that you don't really want to measure here anything other than what you're trying to measure: network bandwidth. And you certainly don't want to measure any of what transpires before any network activity takes place.

I still think that trying to time a sample download is a reasonable proposition. What's not reasonable here is all the popen and wget bloat. So, forget all of that. Throw it out the window. You want to measure how long it takes to download a sample file over HTTP, from each candidate mirror? Well, why don't you do just that?

1) Create a new socket().

2) Use getaddrinfo() to perform a DNS lookup, and obtain the candidate mirror's IP address.

3) connect() to the mirror's HTTP port.

4) Format the appropriate HTTP GET request, and send it to the server.

The above does pretty much what the popen/wget does, up to this point.

And only now I would start the clock running by grabbing the current gettimeofday(), then wait until I read the entire sample file from the socket, then grab the current gettimeofday() once more, to get the ending time of the transmission, and then calculate the actual time it took to receive the file from the mirror.

Only then, will I have some reasonable confidence that I'll be actually measuring the time it takes to receive a sample file from a CDN mirror, and completely ignoring the time it takes to execute a bunch of completely unrelated processes; and then by taking the same sample from multiple CDN mirrors, have any hope of picking one, using as much of a sensible heuristic, as possible.