I'm working on a tool for wikipedia. I'm trying to retrieve the page https://de.wikipedia.org/wiki/Spezial:Linkliste/Hans_Jansen_(Arabist) with file_get_contents. Then I extract all list items by locating the list and exploding it at \n.
Afterwards I want to retrieve the article texts named after the list items. For that I do
file_get_contents(https://de.wikipedia.org/w/index.php?action=raw&title=".urlencode($article));
Everything goes well until the article named Ka'b ibn As'ad which leads to retrieval of
https://de.wikipedia.org/w/index.php?action=raw&title=Ka
When I copy the article name as plain text, everything goes well:
$article = "Ka'b ibn As'ad";
$page = "https://".$server."/w/index.php?action=raw&title=".urlencode($article);
Comparing the output of urlencode for $article typed manually and retrieved from website shows the difference:
manually; Ka%27b+ibn+As%27ad
website: Ka%26%23039%3Bb%20ibn%20As%26%23039%3Bad
Comparing the output with htmlspecialchars() is even more impressive:
manually; Ka'b ibn As'ad
website: Ka'b ibn As'ad
How do I get rid of those ' special characters? Apparently htmlspecialchars_decode() does not work.
htmlspecialchars_decode() only converts html entities that have a name, not those with a number. You need to use html-entity-decode() for this!