I'm trying to graph the linking structure of a web site so I can model how pages on a given domain link to each other. Note I'm not graphing links to sites not on the root domain.
Obviously this graph could be considerable in size. One of the main queries I want to perform is to count how many pages directly link into a given url. I want to run this against the whole graph (shudder) such that I end up with a list of urls and the count of incoming links to that url.
I know one popular way of doing this would be via some kind of map reduce - and I may still end up going that way - however I have a requirement to be able to view this report in (near) realtime which isn't generally map reduce friendly.
I've had a quick look at neo4j and OrientDb. While both of these could model the relationship I want it's not clear if I could query them to generate the report I want. At this point I'm not committed to any particularly technology.
Any help would be greatly appreciated. Thanks, Paul
both OrientDB and Neo4J supports Blueprints as common API to make graph operations like traversal, counting, etc.
If I've understood well your use case your graph seems pretty simple: you have a "URL" Vertex that links each other with one type of Edge "Links".
To execute operation against graphs take a look at Gremlin.