Search code examples
sqlwikipediapagerank

wikipedia page-to-page links by pageid


What?:
I'm trying to get page-to-page link map (matrix) of wikipedia pages by page_id in following format:

from1 to1 to2 to3 ...
from2 to1 to2 to3 ...
...

Why?:
I'm looking for data set (pages from wikipedia) to try out PageRank.

Problem:
At dumps.wikimedia.org it is possible to download pages-articles.xml which is XML with this kind of format:

<page>
  <title>...</title>
  <id>...</id>          // pageid
  <text>...</text>
</page>

that I will use for retrieving articles (text), then also base per-page data (page.sql) which contains some details about pages by page_id and last one that seems relevant to me is pagelinks.sql that contains page-to-page link records. Problem is that pagelinks table has following fields: pl_from, pl_namespace and pl_title.

Idea: Create temporary database, import page and pagelinks tables and create this matrix by using pagelinks table and retrieving page_ids according to pl_titles. Possible solution:

SELECT pl_from, GROUP_CONCAT(page_id SEPARATOR ' ') FROM pagelinks
    JOIN page ON 
        pl_title = page_title AND pl_namespace = page_namespace
GROUP BY pl_from

or for getting map of "backlinks" (to1 from1 from2 from3 ..., not from1 to1 to2 to3 ...):

SELECT page_id, GROUP_CONCAT(pl_from SEPARATOR ' ') FROM pagelinks
    JOIN page ON 
        pl_title = page_title AND pl_namespace = page_namespace
GROUP BY page_id

Question:
Is there a place where to get this kind of matrix of page-to-page links by page_id so that I don't need to create it on my own ? Or if not, is there any faster way how to get this kind of matrix than solution that I've pointed out?


Solution

  • I think importing both tables into a database and using that is probably the easiest way to do this (and you don't need pages-articles.xml for that).

    Another option would be to manually parse the two sql files without importing them to database. I actually created a library to do exactly that in .Net.