Search code examples
c#httpcachingmonoscreen-scraping

Is this a bad way to cache pages for my screen-scraper?


I wrote a simple screen-scraper to help me make vocabulary flash cards for my Greek class. It scrapes the words from an online dictionary, and outputs them in a format that my flash card manager can understand.

I don't want to bombard the dictionary with requests every time I run the scraper, so I cache each page to my hard drive the first time I load them (of course, this also makes it much faster). I've never done any caching before, so I'm not sure what best practices for this sort of thing are. Here is my solution:

using System;
using System.IO;
using System.Net;
using System.Web;

public class PerseusDocument
{
    readonly string url;

    public PerseusDocument (string url)
    {
        this.url = url;

        ... // (load the DOM with a third-party library)
    }

    static string cacheUrl;
    static string CacheUrl {
        get {
            if (cacheUrl == null) {
                cacheUrl = Path.Combine (Environment.GetFolderPath (Environment.SpecialFolder.LocalApplicationData), "perseus");
                Directory.CreateDirectory (cacheUrl);
            }

            return cacheUrl;
        }
    }

    string FullCacheUrl {
        get { return Path.Combine (CacheUrl, HttpUtility.UrlEncode (url)); }
    }

    bool IsCached {
        get { return File.Exists (FullCacheUrl); }
    }

    string Html {
        get {
            if (IsCached)
                return File.ReadAllText (FullCacheUrl);

            WebClient client = new WebClient ();
            string html = client.DownloadString (url);

            using (StreamWriter file = new StreamWriter (FullCacheUrl))
                file.Write (html);

            return html;
        }
    }
}

In other words, I simply check if a file with the same name as the url exists in the cache. If so, I load it, if not, I save the html to a new file. Are there any glaring issues with doing things this way?


Solution

  • You don't need to build your own cache. All your requests will automatically be routed through WinINetCache. To turn on the cache simply do:

    webClient.CachePolicy = new RequestCachePolicy(RequestCacheLevel.Default);
    

    As long as the server has defined their pages as cacheable, then caching will happen automatically.