Search code examples
c#.netweb-scraping

Download only the first portion (of unknown length) of a web page using C#


I'm writing a personal app that scrapes data from a website. It currently pulls entire pages before analyzing them and these pages can range from 300 - 600 KiB. The 10 pages that I tested against total up to 4 MiB. The page contains dynamic content so I don't know exactly where the data starts at. I do have delimiters so that I know where the data is at once I've scanned the page. Is there any way to only download up to the portion that I need? This would cut the total download down to 2 MiB for those 10 pages.


Solution

  • Here is a simple example, where you read from a stream until you match a 10 byte delimiter which matches your own. Although the specifics are up to you to handle, I think this represents an easy method to achieve what you want.

    StringBuilder sb = new StringBuilder();
    HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://example.com");
    using (var resp = request.GetResponse())
    {
        using(StreamReader sr = new StreamReader(resp.GetResponseStream()))
        {
            char[10] block;
            sr.ReadBlock(block, 0, 10);
            if (block.CharEquals(myDelim))
                break;
            sb.Append();
        }
    }
    // Process the StringBuilder here.
    

    Please note that CharEquals is an extension method that simply compares if two character arrays are equal - there's nothing special to it.