What?:
I'm trying to get page-to-page link map (matrix) of wikipedia pages by page_id
in following format:
from1 to1 to2 to3 ...
from2 to1 to2 to3 ...
...
Why?:
I'm looking for data set (pages from wikipedia) to try out PageRank.
Problem:
At dumps.wikimedia.org it is possible to download pages-articles.xml which is XML with this kind of format:
<page>
<title>...</title>
<id>...</id> // pageid
<text>...</text>
</page>
that I will use for retrieving articles (text
), then also base per-page data (page.sql) which contains some details about pages by page_id
and last one that seems relevant to me is pagelinks.sql that contains page-to-page link records. Problem is that pagelinks
table has following fields: pl_from
, pl_namespace
and pl_title
.
Idea: Create temporary database, import page
and pagelinks
tables and create this matrix by using pagelinks
table and retrieving page_id
s according to pl_title
s. Possible solution:
SELECT pl_from, GROUP_CONCAT(page_id SEPARATOR ' ') FROM pagelinks
JOIN page ON
pl_title = page_title AND pl_namespace = page_namespace
GROUP BY pl_from
or for getting map of "backlinks" (to1 from1 from2 from3 ...
, not from1 to1 to2 to3 ...
):
SELECT page_id, GROUP_CONCAT(pl_from SEPARATOR ' ') FROM pagelinks
JOIN page ON
pl_title = page_title AND pl_namespace = page_namespace
GROUP BY page_id
Question:
Is there a place where to get this kind of matrix of page-to-page links by page_id
so that I don't need to create it on my own ?
Or if not, is there any faster way how to get this kind of matrix than solution that I've pointed out?
I think importing both tables into a database and using that is probably the easiest way to do this (and you don't need pages-articles.xml
for that).
Another option would be to manually parse the two sql files without importing them to database. I actually created a library to do exactly that in .Net.