Search code examples
windowswgetwikipedia

Download an article with unicode title from Wikipedia using wget in xml format


I am currently downloading the XML from Wikipedia for individual articles. For this I use wget with the following call format

https://de.wiktionary.org/wiki/Special:Export/?title=Special:Export&pages=**<page>**&curonly=1&templates=1&action=submit 

This also works, but I have problems with e.g. Cyrillic characters. They are encoded for the page (a lot of %). But this does not seem to work. I always get back only the schema definition. If I enter the address (see above) in the browser it works. I have already tried with --remote-encoding=UTF-8 . It affects windows!


Solution

  • It is not sufficient to set the encoding for the target server via

     --remote-encoding=UTF8
    

    to specify. For the input it is also mandatory to do this.

    --local-encoding=UTF8
    

    Then wget does not replace it with the % replacement. Otherwise wget assumes ASCII encoding and uses % replacement.