Search code examples
rdfsparqldbpedia

Select DBpedia resource with at least N occurrences of seleted word in abstract?


I have this request that results some DBpedia resources and their abstracts. How can I filter the results to get just the resources whose abstracts contain at least a certain number of occurrences of a particular word?

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbpedia-owl:<http://www.dbpedial.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

select distinct ?resource ?url ?resume where {
   ?resource rdfs:label ?Nom.
   ?resource foaf:isPrimaryTopicOf ?url.
   ?resource dbo:abstract ?resume.
   FILTER langMatches( lang(?Nom), "EN" )
   FILTER langMatches( lang(?resume), "EN" )
   ?Nom <bif:contains> "apple".             
}  

This is the new request without Bind function:

select (strlen(replace(replace(Lcase(?resume), 'Jobs', '_'),'[^_]', '')) as ?nbr )  ?resource ?url 
where {
?resource rdfs:label ?Nom.
   ?resource foaf:isPrimaryTopicOf ?url.
   ?resource dbo:abstract ?resume.
FILTER langMatches( lang(?Nom), "EN" )    
FILTER langMatches( lang(?resume), "EN" )
?Nom <bif:contains> "Apple".}
GROUP BY ?Nom
Having(?nbr >= 1)      

Solution

  • This won't be absolutely perfect, but it should work relatively well for what you're trying to accomplish. You can use replace to replace all the instances of the word you want to count with some single character (e.g., '_'). Then you can use replace again to replace everything except that character with the empty string. Then, you have a string like '______', where the length is the number of times that the word appeared in the string. For instance, here's a query that counts 'the' in the abstract, and keeps only those where 'the' appears at least five times.

    select ?x ?nThe {
      values ?x { dbr:Horse dbr:Cat dbr:Dog }
      ?x dbo:abstract ?abs 
      filter langMatches(lang(?abs),'en')
      bind(strlen(replace(replace(?abs, '\\sthe\\s', '_'),'[^_]', '')) as ?nThe)
      filter (?nThe >= 5)
    }
    

    SPARQL results