Search code examples
c#streamreader

Garbage value while reading HTML body using C#


I have below HTML file which contains content like below:

<HTML>
<BODY>
...
........ company's Chief Financial Officer.   Now the.......
...
</BODY>
</HTML>

I am reading the content of this file using:

StringBuilder stringBuilder = new StringBuilder();
using (StreamReader sr = new StreamReader(filePath))
{
   String line = sr.ReadToEnd();
   stringBuilder.Append(line);
}
strFileContent = stringBuilder.ToString();

However it is returning string as:

........ company�s Chief Financial Officer.���Now the.......

HTML files are in my local system.


Solution

  • You need to use the same encoding which was used to create the file. StreamReader assumes your encoding is UTF8 by default and tries to decode the file using that, but your original encoding is windows-1252(as you said in comments). Trying to read with wrong encoding produces junk data for obvious reasons.

    You should explicitly say what encoding the file is in. Here's how you do it.

    var encoding = Encoding.GetEncoding(1252);//windows-1252
    using (StreamReader sr = new StreamReader(filePath, encoding))
    ...
    

    Bonus reading