Search code examples
c#multithreadingasynchronoushttpwebrequestbackgroundworker

Do you need background workers or multiple threads to fire multiple Async HttpWebRequests?


Overall goal

I'm trying to call to the Google PageSpeed Insights API with mutliple input urls read from a .txt file and to output the results to a .csv.

What I tried

I wrote a console app to try to fire these requests off, and then as they come back to add them to a list, and when they are all done, to write the list to the .csv file (async got a little nutty when trying to write the responses immediately to the .csv).

My code it below, and far from optimized. I come form a JavaScript background, where I usually don't use web workers or any other managed new threads, so I was trying to do the same in C#.

  1. Can I run do these multiple WebRequests and write them to a collection (or output file) without using multiple threads and have them all run in parallel, not having to wait for each request to come back before handling the next one?
  2. Is there a cleaner way to do this with callbacks?
  3. If threads or BackgroundWorkers are needed, what's a Clean Code way of doing this?

Initial Example Code

static void Main(string[] args)
{
    Console.WriteLine("Begin Google PageSpeed Insights!");

    appMode = ConfigurationManager.AppSettings["ApplicationMode"];
    var inputFilePath = READ_WRITE_PATH + ConfigurationManager.AppSettings["InputFile"];
    var outputFilePath = READ_WRITE_PATH + ConfigurationManager.AppSettings["OutputFile"];

    var inputLines = File.ReadAllLines(inputFilePath).ToList();

    if (File.Exists(outputFilePath))
    {
        File.Delete(outputFilePath);
    }

    List<string> outputCache = new List<string>();

    foreach (var line in inputLines)
    {
        var requestDataFromPsi = CallPsiForPrimaryStats(line);
        Console.WriteLine($"Got response of {requestDataFromPsi.Result}");

        outputCache.Add(requestDataFromPsi.Result);
    }

    var writeTask = WriteCharacters(outputCache, outputFilePath);

    writeTask.Wait();

    Console.WriteLine("End Google PageSpeed Insights");
}

private static async Task<string> CallPsiForPrimaryStats(string url)
{
    HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create($"https://www.googleapis.com/pagespeedonline/v2/runPagespeed?url={url}&strategy=mobile&key={API_KEY}");
    myReq.Method = WebRequestMethods.Http.Get;
    myReq.Timeout = 60000;
    myReq.Proxy = null;
    myReq.ContentType = "application/json";

    Task<WebResponse> task = Task.Factory.FromAsync(
            myReq.BeginGetResponse,
            asyncResult => myReq.EndGetResponse(asyncResult),
            (object)null);

    return await task.ContinueWith(t => ReadStreamFromResponse(t.Result));
}

private static string ReadStreamFromResponse(WebResponse response)
{
   using (Stream responseStream = response.GetResponseStream())
   using (StreamReader sr = new StreamReader(responseStream))
   {
       string jsonResponse = sr.ReadToEnd();
       dynamic jsonObj = Newtonsoft.Json.JsonConvert.DeserializeObject(jsonResponse);

       var psiResp = new PsiResponse()
       {
           Url = jsonObj.id,
           SpeedScore = jsonObj.ruleGroups.SPEED.score,
           UsabilityScore = jsonObj.ruleGroups.USABILITY.score,
           NumberResources = jsonObj.pageStats.numberResources,
           NumberHosts = jsonObj.pageStats.numberHosts,
           TotalRequestBytes = jsonObj.pageStats.totalRequestBytes,
           NumberStaticResources = jsonObj.pageStats.numberStaticResources,
           HtmlResponseBytes = jsonObj.pageStats.htmlResponseBytes,
           CssResponseBytes = jsonObj.pageStats.cssResponseBytes,
           ImageResponseBytes = jsonObj.pageStats.imageResponseBytes,
           JavascriptResponseBytes = jsonObj.pageStats.javascriptResponseBytes,
            OtherResponseBytes = jsonObj.pageStats.otherResponseBytes,
            NumberJsResources = jsonObj.pageStats.numberJsResources,
            NumberCssResources = jsonObj.pageStats.numberCssResources,

        };
        return CreateOutputString(psiResp);
    }
}

static async Task WriteCharacters(List<string> inputs, string outputFilePath)
{
    using (StreamWriter fileWriter = new StreamWriter(outputFilePath))
    {
        await fileWriter.WriteLineAsync(TABLE_HEADER);

        foreach (var input in inputs)
        {
            await fileWriter.WriteLineAsync(input);
        }
    }
}

private static string CreateOutputString(PsiResponse psiResponse)
{
    var stringToWrite = "";

    foreach (var prop in psiResponse.GetType().GetProperties())
    {
        stringToWrite += $"{prop.GetValue(psiResponse, null)},";
    }
    Console.WriteLine(stringToWrite);
    return stringToWrite;
}

Update: After Refactor from Stephen Cleary Tips

Problem is this still runs slow. The original took 20 minutes, and after refactor it still took 20 minutes. It seems to be throttled somewhere, maybe by Google on the PageSpeed API. I tested it, calling calling https://www.google.com/, https://www.yahoo.com/, https://www.bing.com/ and 18 others and it runs slowly as well, having a bottleneck of only processing about 5 requests at a time. I tried refactoring to run 5 test URLs and then write to file and repeat but it only marginally sped up the process.

static void Main(string[] args) { MainAsync(args).Wait(); }
static async Task MainAsync(string[] args)
{
    Console.WriteLine("Begin Google PageSpeed Insights!");

    appMode = ConfigurationManager.AppSettings["ApplicationMode"];
    var inputFilePath = READ_WRITE_PATH + ConfigurationManager.AppSettings["InputFile"];
    var outputFilePath = READ_WRITE_PATH + ConfigurationManager.AppSettings["OutputFile"];

    var inputLines = File.ReadAllLines(inputFilePath).ToList();

    if (File.Exists(outputFilePath))
    {
        File.Delete(outputFilePath);
    }

    var tasks = inputLines.Select(line => CallPsiForPrimaryStats(line));
    var outputCache = await Task.WhenAll(tasks);

    await WriteCharacters(outputCache, outputFilePath);

    Console.WriteLine("End Google PageSpeed Insights");
}

private static async Task<string> CallPsiForPrimaryStats(string url)
{
    HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create($"https://www.googleapis.com/pagespeedonline/v2/runPagespeed?url={url}&strategy=mobile&key={API_KEY}");
    myReq.Method = WebRequestMethods.Http.Get;
    myReq.Timeout = 60000;
    myReq.Proxy = null;
    myReq.ContentType = "application/json";
    Console.WriteLine($"Start call: {url}");

    // Try using `HttpClient()` later
    //var myReq2 = new HttpClient();
    //await myReq2.GetAsync($"https://www.googleapis.com/pagespeedonline/v2/runPagespeed?url={url}&strategy=mobile&key={API_KEY}");

    Task<WebResponse> task = Task.Factory.FromAsync(
        myReq.BeginGetResponse,
        myReq.EndGetResponse,
        (object)null);
    var result = await task;
    return ReadStreamFromResponse(result);
}

private static string ReadStreamFromResponse(WebResponse response)
{
    using (Stream responseStream = response.GetResponseStream())
    using (StreamReader sr = new StreamReader(responseStream))
    {
        string jsonResponse = sr.ReadToEnd();
        dynamic jsonObj = Newtonsoft.Json.JsonConvert.DeserializeObject(jsonResponse);

        var psiResp = new PsiResponse()
        {
            Url = jsonObj.id,
            SpeedScore = jsonObj.ruleGroups.SPEED.score,
            UsabilityScore = jsonObj.ruleGroups.USABILITY.score,
            NumberResources = jsonObj.pageStats.numberResources,
            NumberHosts = jsonObj.pageStats.numberHosts,
            TotalRequestBytes = jsonObj.pageStats.totalRequestBytes,
            NumberStaticResources = jsonObj.pageStats.numberStaticResources,
            HtmlResponseBytes = jsonObj.pageStats.htmlResponseBytes,
            CssResponseBytes = jsonObj.pageStats.cssResponseBytes,
            ImageResponseBytes = jsonObj.pageStats.imageResponseBytes,
            JavascriptResponseBytes = jsonObj.pageStats.javascriptResponseBytes,
            OtherResponseBytes = jsonObj.pageStats.otherResponseBytes,
            NumberJsResources = jsonObj.pageStats.numberJsResources,
            NumberCssResources = jsonObj.pageStats.numberCssResources,

        };
        return CreateOutputString(psiResp);
    }
}

static async Task WriteCharacters(IEnumerable<string> inputs, string outputFilePath)
{
    using (StreamWriter fileWriter = new StreamWriter(outputFilePath))
    {
        await fileWriter.WriteLineAsync(TABLE_HEADER);

        foreach (var input in inputs)
        {
            await fileWriter.WriteLineAsync(input);
        }
    }
}

private static string CreateOutputString(PsiResponse psiResponse)
{
    var stringToWrite = "";
    foreach (var prop in psiResponse.GetType().GetProperties())
    {
        stringToWrite += $"{prop.GetValue(psiResponse, null)},";
    }
    Console.WriteLine(stringToWrite);
    return stringToWrite;
}

Solution

  • Can I run do these multiple WebRequests and write them to a collection (or output file) without using multiple threads and have them all run in parallel, not having to wait for each request to come back before handling the next one?

    Yes; what you're looking for is asynchronous concurrency, which uses Task.WhenAll.

    Is there a cleaner way to do this with callbacks?

    async/await is cleaner than callbacks. JavaScript has moved from callbacks, to promises (similar to Task<T> in C#), to async/await (very similar to async/await in C#). The cleanest solution in both languages is now async/await.

    There are a few gotchas in C#, though, largely due to backwards compatibility.

    1) In asynchronous Console apps, you do need to block the Main method. This is, generally speaking, the only time you should block on asynchronous code:

    static void Main(string[] args) { MainAsync(args).Wait(); }
    static async Task MainAsync(string[] args)
    {
    

    Once you have an async MainAsync method, you can use Task.WhenAll for asynchronous concurrency:

      ...
      var tasks = inputLines.Select(line => CallPsiForPrimaryStats(line));
      var outputCache = await Task.WhenAll(tasks);
      await WriteCharacters(outputCache, outputFilePath);
      ...
    

    2) You shouldn't use ContinueWith; it's a low-level, dangerous API. Use await instead:

    private static async Task<string> CallPsiForPrimaryStats(string url)
    {
      ...
      Task<WebResponse> task = Task.Factory.FromAsync(
          myReq.BeginGetResponse,
          myReq.EndGetResponse,
          (object)null);
      var result = await task;
      return ReadStreamFromResponse(result);
    }
    

    3) There are often more "async-friendly" types available. In this case, consider using HttpClient instead of HttpWebRequest; you'll find that your code cleans up quite a bit.