Search code examples
wikiwikipedia

how to use information provided in wiki download's index file?


I am trying to do some research about chinese persons by using wiki data. Other than using dbpedia (as info about chinese person is bit limited comparing to zh.wikipedia.org), I found that I can download directly from zhwiki http://download.wikipedia.com/zhwiki/20150301/.

I see there is an index file, from the file I can see row such as: 966576:291:人物

Which I assume is a lookup key? Can someone tell me how to use this lookup key to search the main file or database?


Solution

  • There are two files

    • zhwiki-20150301-pages-articles-multistream.xml.bz2 1.1 GB - it has multiple bz2 streams, 100 pages per stream
    • zhwiki-20150301-pages-articles-multistream-index.txt.bz2 18.8 MB - index file

    index file has lines

    • offset1:pageId1:title1
    • offset1:pageId2:title2
    • ..
    • offset2:pageId101:title101 and so on.

    offset is starting offset of bz2 stream. You need to read bytes from offset1 to offset2 from bz2 file and pass them to bz2 decoder and it will give you xml dump of 100 pages from that stream