Search code examples
c#httpwebrequest

How to get webpage title without downloading all the page source


I'm looking for a method that will allow me to get the title of a webpage and store it as a string.

However all the solutions I have found so far involve downloading the source code for the page, which isn't really practical for a large number of webpages.

The only way I could see would be to limit the length of the string or it only downloads either a set number of chars or stops once it reaches the tag, however this obviously will still be quite large?

Thanks


Solution

  • As the <title> tag is in the HTML itself, there will be no way to not download the file to find "just the title". You should be able download a portion of the file until you've read in the <title> tag, or the </head> tag and then stop, but you'll still need to download (at least a portion of) the file.

    This can be accomplished with HttpWebRequest/HttpWebResponse and reading in data from the response stream until we've either read in a <title></title> block, or the </head> tag. I added the </head> tag check because, in valid HTML, the title block must appear within the head block - so, with this check we will never parse the entire file in any case (unless there is no head block, of course).

    The following should be able to accomplish this task:

    string title = "";
    try {
        HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
        HttpWebResponse response = (request.GetResponse() as HttpWebResponse);
    
        using (Stream stream = response.GetResponseStream()) {
            // compiled regex to check for <title></title> block
            Regex titleCheck = new Regex(@"<title>\s*(.+?)\s*</title>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
            int bytesToRead = 8092;
            byte[] buffer = new byte[bytesToRead];
            string contents = "";
            int length = 0;
            while ((length = stream.Read(buffer, 0, bytesToRead)) > 0) {
                // convert the byte-array to a string and add it to the rest of the
                // contents that have been downloaded so far
                contents += Encoding.UTF8.GetString(buffer, 0, length);
    
                Match m = titleCheck.Match(contents);
                if (m.Success) {
                    // we found a <title></title> match =]
                    title = m.Groups[1].Value.ToString();
                    break;
                } else if (contents.Contains("</head>")) {
                    // reached end of head-block; no title found =[
                    break;
                }
            }
        }
    } catch (Exception e) {
        Console.WriteLine(e);
    }
    

    UPDATE: Updated the original source-example to use a compiled Regex and a using statement for the Stream for better efficiency and maintainability.