I'm trying to implement PageRank
algorithm on a set of web pages, for that I need a sample dataset
of web pages, and the web graph corresponding to them, this web graph represents the links between the pages that the data set contains.
I need the web graph so I can get the transition matrix and do the calculation needed. Example:
URL1 -> URL2
URL3390 -> URL5
URLxxxx
is an id, somehow mapped to the corresponding web page
My question is: how/where can I get this resource (I've tried many links on the internet but nothing really helps), I would also like it to be not of a very large size, (internet connection limitation), if I can't have this as it is, could sou give me some advice on what I should do?
Update: for people who may consider this off topic, and they may be right, networks like Software Recommendation or on Computer Science, don't even have corresponding tags, and doesn't really fit the kind of this question, I appreciate your help.
May be Site Visualizer is the tool you're looking for. The app has the feature to generate visual sitemap.
Download and install the app (Standard or Pro version), click Create new project toolbutton, type the URL of the website you need to crawl, and then click Start button.
After the crawling is finished, click Draw button on the Visual Sitemap tab. Graph of the website will be drawn as a set of pages (rectangles) and links (lines with arrows). Click on a box to select the certain page and highlight its outbound links:
Dataset of all links of the website you can get by using All Links report (on the Reports tab). 'From URL' and 'To URL' columns are what you need.
Besides of that, you can represent a dataset of pages or links of the crawled website by using your particular SQL query. For instance, go to the Database tab, type the following query and click Execute toolbutton:
SELECT * FROM links WHERE link_type='A'
The resultset will contain only A-tag links, excluding images, CSS files, JS, etc.
The program has full-featured 30-days trial period, so you can carry out your tasks for free.