I am building a keyword density analyser. I have build the keyword analyser which works absolutely fine with websites that have English content and UTF-8 encoding. When I crawl a website like myegy.com, the arabic keywords show up as question marks in my website. I have tried iconv and mb_convert_strings and both of them are not working.
I need help creating a keyword density program which is able to crawl all languages and encodings and store them in a database with utf-8 encoding and display them back...
I am a newbie on the encodings so your help will be really appreciated...
Displayed on my page as ����� and with iconv -> ÈÌæÏÉ. It should be displayed in arabic though which I am not able to show as the arabic is shown as the question marks.
myegy.com uses windows-1256 encoding. Iconv supports it. It should work, as long as you're finding the declaration and using iconv
correctly.
When crawling the web, you'll find a lot of different encodings, some of them will be incorrectly named, some will be bogus. Lots of pages will lack encoding declarations and rely on browsers guessing the encoding.
If you want to support all encodings as well as possible, you will need to implement HTML5 encoding detection algorithm:
Also note that PHP's built-in DOMDocument::loadHTML()
supports very few encodings. You'll have to convert documents (and encoding declarations in them) to UTF-8 first.