Search code examples
wikipediawikidata

"EOFError: Ran out of input" while use Wikipedia Extractor as a parser for Wikipedia Data Dump File


I've tried to convert bz2 to text with "Wikipedia Extractor(https://github.com/attardi/wikiextractor). I've downloaded wikipedia dump with bz2 extension then on command line used this line of code:

python Wikiextractor.py -b 85M -o extracted D:\wikiextractor-master\wikiextractor\zhwiki-latest-pages-articles.xml.bz2

After finishing preprocessing the pages, I came out with error like this: enter image description here

How can I fix this?


Solution

  • I encountered this problem. Likely caused by the StringIO issue with Windows. I re-run it on Windows Subsystem for Linux (WSL) and it went well.