Search code examples
rdfsparqldbpedia

How to remove unreadable parts for Sparql Query result?


Query

select distinct ?abstract where {
      [ rdfs:label "Rome"@en ;
        dbpedia-owl:abstract ?abstract ]
      filter langMatches(lang(?abstract),'en')
    }

Output:

Rome (/ˈroʊm/; Italian: Roma pronounced [ˈroːma] ; Latin: Rōma) is a city and special comune (named "Roma Capitale") in Italy…

How can I remove "(/ˈroʊm/; Italian: Roma pronounced [ˈroːma] ; Latin: Rōma)", which contains unreadable characters (i.e., the pronunciation guide)?

I got the query from the link.


Solution

  • You could use a query like this to remove text in parentheses:

    select ?abstract ?cleanAbstract where {
      values ?x { dbpedia:Rome }
    
      ?x dbpedia-owl:abstract ?abstract
      filter langMatches(lang(?abstract),'en')
    
      bind( replace( str(?abstract), '\\([^(]*\\)', "" ) as ?cleanAbstract )
    }
    

    SPARQL results

    ?abstract: Rome (/ˈroʊm/; Italian: Roma pronounced [ˈroːma] ; Latin: Rōma) is a city and special comune (named "Roma Capitale") in Italy. Rome is the capital of Italy and also of the Province of Rome and of the region of Lazio. With 2.7 million residents in 1,285.3 km2 (496.3 sq mi), it is also the country's largest …

    ?cleanAbstract: Rome is a city and special comune in Italy. Rome is the capital of Italy and also of the Province of Rome and of the region of Lazio. With 2.7 million residents in 1,285.3 km2 , it is also the country's largest …

    Of course, pronunciations are not the only thing found in parentheses. E.g., the area in square miles was given in parentheses. However, if abstracts follow the general convention that text in parentheses can be removed without altering the essential content of the text, this might work for you. You can, of course, improve the regular expression to handle spaces around the parentheses a bit better, or to only remove those with some "non-typical" characters, if you can define some.