Search code examples
c#asp.nethtml-agility-packstreamwriter

Writing the contents of a scraped page to a text file to download on client's browser


I am trying to figure out how to print the contents of a scraped web page to a downloadable .txt file on an Asp.Net web page. I currently am able to print the contents of this page to a label on the web page but cannot figure out how to properly print each value on a new line into a .txt file and download it straight to the client's browser. Currently my code is the following for printing to the label:

//Read HTML of Webpage inserted into urlTextbox
HtmlWeb hw = new HtmlWeb();
        HtmlDocument doc = hw.Load(urlTextbox.Text);

        //Selecting body text
        var bodySec = doc.DocumentNode.SelectNodes("//body[@class]");

        foreach (var node in bodySec)
        {
            //Selecting ONLY links from body section
            var linkSec = doc.DocumentNode.SelectNodes(".//a[@href]");
            foreach (HtmlNode node2 in linkSec)
            {
                string attributeValue = node2.GetAttributeValue("href", "");
                var baseUrl = new Uri("url.com");
                var url = new Uri(baseUrl, attributeValue);

                string links = url.AbsoluteUri;
                scriptLbl.Text += links;
                var linkLines = Regex.Split(links, @"\-\-\-");

                ////Printing Links line by line
                foreach(string link in linkLines)
                {
                    var prt1 = link + "<br>";
                    scriptLbl.Text += prt1;
                }


            }
        }

Currently it scrapes the page wonderfully and prints the links in the desired format. Optimally I would like to write to a file in the same format and have it downloaded on the same button click. I have tried using StreamWriter to accomplish this, but it only ever printed the first line of the scraping contents. The following is my attempt w/ StreamWriter:

Response.ContentType = "text/plain";
Response.AddHeader("content-disposition", "attachment;filename=Urllist.txt");
Response.Clear();
using (StreamWriter writer = new StreamWriter(Response.OutputStream, Encoding.UTF8))
      {
           writer.Write(links);
      }

Response.End();

Any help on this issue would be greatly appreciated. I have tried using other similar answers to questions, but none seem to provide me with the full list of links from the string.


Solution

  • I solved this issue by creating a list of the items read from the label and iterating through them individually.

    string conv = label.Text;
    var result = con.Split(' ');
    using(StreamWriter sw = new StreamWriter(Response.OutputStream, Encoding.UTF8))
    {
        foreach(var s in result.Distinct()) 
        {
            //using distinct to ensure no repeated items (scraping multiple pages w/ same links possible)
            sw.WriteLine(s);
        }
    }