Search code examples
phpurlencodehtmlspecialchars

Getting rid of ' in output of file_get_contents


I'm working on a tool for wikipedia. I'm trying to retrieve the page https://de.wikipedia.org/wiki/Spezial:Linkliste/Hans_Jansen_(Arabist) with file_get_contents. Then I extract all list items by locating the list and exploding it at \n.

Afterwards I want to retrieve the article texts named after the list items. For that I do

 file_get_contents(https://de.wikipedia.org/w/index.php?action=raw&title=".urlencode($article));

Everything goes well until the article named Ka'b ibn As'ad which leads to retrieval of

https://de.wikipedia.org/w/index.php?action=raw&title=Ka

When I copy the article name as plain text, everything goes well:

 $article = "Ka'b ibn As'ad";
 $page = "https://".$server."/w/index.php?action=raw&title=".urlencode($article);

Comparing the output of urlencode for $article typed manually and retrieved from website shows the difference:

  manually; Ka%27b+ibn+As%27ad
  website:  Ka%26%23039%3Bb%20ibn%20As%26%23039%3Bad

Comparing the output with htmlspecialchars() is even more impressive:

  manually; Ka'b ibn As'ad
  website:  Ka'b ibn As'ad

How do I get rid of those ' special characters? Apparently htmlspecialchars_decode() does not work.


Solution

  • htmlspecialchars_decode() only converts html entities that have a name, not those with a number. You need to use html-entity-decode() for this!