Search code examples
sparqlstardog

Parsing SPARQL results to obtain hostname


I have a huge list of triples like this:

?s ex:url ?url

Where ?url can be:

www.ex.com/data/1.html
www.ex.com/data/2.html
www.google.com/search
...

Is it possible, with a SPARQL query, to filter the query somehow and obtain the distinct list of domains? In the example, www.ex.com and www.google.com.

Somthing like this:

SELECT distinct ?url
WHERE { ?s ex:url ?url }

But treating each url bind. Of course I could get them all, and treat each url one by one in my program, but I suppose a sparql query would be more memory efficient. I am using Stardog - in case it has some custom functionality.


Solution

  • Use REPLACE with REGEX:

    BIND(REPLACE(STR(?url), "^(.*?)/.*", "$1") AS ?domain)
    

    Example in Yasgui

    Edit: As @JoshuaTailor noted in the comments, STRBEFORE is better if there is no scheme in ?url:

    BIND(STRBEFORE(?url, "/") AS ?domain)
    

    If you need to worry about the URL scheme (this discards the scheme):

    BIND(REPLACE(STR(?url), "^(https?://)?(.*?)/.*", "$2") AS ?domain)
    

    Of course, the above only works for basic http(s) URLs, and the regex becomes somewhat more complex if arbitrary URLs need to be handled.

    Here's one that handles any or missing scheme, port number, auth info, and missing trailing slash:

    BIND(REPLACE(?url, "^(?:.*?://)?(?:.*?@)?([^:]+?)(:\\d+)?((/.*)|$)", "$1") AS ?domain)
    

    Note that queries with regular expressions can be quite slow.