Search code examples
asp.netzipscreen-scraping

Scrape current request and zip it up


I have an asp.net website which contains a few pages that I'd like to export their generated content and send to another service for archiving.

The best way that I can fathom doing this is to grab the stream and dump it to a file which is easy enough to do. My main challenge would be follow the external resources and include them in the zip file. I would like to include stylesheets and images, and images included in the style sheet. I need the stream at request time because the stream that generated is dependent on things like the current session.

I'm wondering also if perhaps all these locations should be normalized, in other words, reroute the references to the same directory with the main document resides.

I can guarantee that all external resources will be located on the same server.

Is this something that can be done with the HtmlAgilityPack? It seemed that I may be able to do a lot of manual work with this utility, but am going to be able to use it query images referenced in stylesheets?

Trying to do some discovery on this topic while completing some other tasks.

Thanks.


Solution

  • I checked in my source at GitHub if you would like to see how I did this.

    My solution isn't perfect but it works for what I need it to do. Some problems that might arise are in the normalization script. HtmlAgility Pack does not emit XHTML, just HTML, so I just used it to find my src and href elements that I wanted to replace, and then I just replaced the found values in the original source with my normalized paths.

    Also I've encountered a bug with zip archiving, but I'm not so sure what that issue is yet. If anyone has some improvements that they would like to add, let me know.

    Thanks