Search code examples
c#asp.netscreen-scraping

How to scrape the contents of an axd resource?


Essentially I have an img tag with a src attribute of /ChartImg.axd?i=chart_0_0.png&g=06469eea67ea452b977f8e73cad70691. Do I need to create another WebRequest to get the content of this resource or is there a simpler way?

I am scraping the output of the current request. Below is what I've got so far...

Essentially my additionaAssets will contain in some instances the relative Uri for a .axd resource. I would like to include that content in the archive I am building.

    private void ProcessPrintRequest()
    {
        this.Response.Clear();
        this.Response.ContentType = "application/zip";
        this.Response.AddHeader("Content-Disposition", "attachment;filename=archive.zip");

        using (var stream = new ZipOutputStream(new ZeroByteStreamWrapper(this.Response.OutputStream)))
        {
            stream.SetLevel(9);

            var additionalAssets = new PathNormailzationDictionary();

            this.ExportDocument(stream, additionalAssets);
            this.ExportAdditionalAssets(stream, additionalAssets);
        }

        this.Response.End();
    }

    private void ExportAdditionalAssets(ZipOutputStream stream, PathNormailzationDictionary additionalAssets)
    {
        var buffer = new byte[32 * 1024];
        int read;

        // TODO: Request content of .axd resources
        foreach (var item in additionalAssets.Where(item => File.Exists(Server.MapPath(item.Key))))
        {
            var entry = new ZipEntry(item.Value);

            stream.PutNextEntry(entry);

            using (var fileStream = File.OpenRead(Server.MapPath(item.Key))) 
            {
                while ((read = fileStream.Read(buffer, 0, buffer.Length)) > 0)
                {
                    stream.Write(buffer, 0, read);
                }
            }
        }
    }

    private void ExportDocument(ZipOutputStream stream, PathNormailzationDictionary additionalAssets)
    {
        var entry = new ZipEntry("index.html");

        stream.PutNextEntry(entry);

        var document = this.GetNormalizedDocument(additionalAssets);

        var writer = new StreamWriter(stream);
        writer.Write(document);
        writer.Flush();
    }

    private string GetNormalizedDocument(PathNormailzationDictionary additionalAssets);

Solution

  • Yes, you have to create another webrequest. Any given HTML page consists of multiple http requests; one for the html page, then another for each external SRC. No getting away from it.

    -Oisin