I make a simple web scraper that scrapes lyrics for me then writes it to a database. everything works but for some reason it's replacing some characters with question marks and when I view this information on a simple php web page I'm seeing a lot of mistakes in the lyrics.
I?m = I'm
Let?s = Let's
haven?t = haven't
stuff like that.
I know the error is in c# and my code because I put a breakpoints before it writes to the database and I display it in a rich text box. How would I get it to display these characters correctly?
public static string getSourceCode(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string sourceCode = sr.ReadToEnd();
sr.Close();
resp.Close();
return sourceCode;
}
........
string url = txbURL2.Text;
string sourceCode = sourceCode = WorkerClass.getSourceCode(url);
int startIndex = sourceCode.IndexOf("<td valign=\"top\" width=\"100%\">");
sourceCode = sourceCode.Substring(startIndex, sourceCode.Length - startIndex);
........
//Gets Lyric
startIndex = sourceCode.IndexOf("<br><b>Lyrics:</b><br><br>") + 30;
endIndex = sourceCode.IndexOf(" <br><br>", startIndex);
string lyric = sourceCode.Substring(startIndex, endIndex - startIndex) + "";
rtbLyric.Text = lyric;
//End Lyric
The problem is probably character encoding. My guess is that the web page you're scraping is encoded in UTF8, but somewhere along the line you're converting to ASCII.
Check out the excellent article called "What every developer should know about character encoding" for more details.
Update
You could try this, although the StreamReader
should default to UTF-8 anyway:
var encoding = System.Text.Encoding.GetEncoding("utf-8");
StreamReader sr = new StreamReader(resp.GetResponseStream(), encoding);