Search code examples
regexsparqlvirtuosopersian

SPARQL regex doesn't match Persian characters with the "i" flag


I expect the ignore case "i" flag to only increase the amount of matches, not to decrease them, but the following SPARQL query (endpoint http://www.snik.eu/sparql) does result in one match without the flag but no matches with it:

select * { ?s rdfs:label ?l. filter(regex(str(?l),"قانون بیمارستان")) }

-> 1 match

select * { ?s rdfs:label ?l. filter(regex(str(?l),"قانون بیمارستان","i")) }

-> no match

With non-Persian letters it works as expected:

select count(*) { ?s rdfs:label ?l.filter(regex(str(?l),"Information"))}

-> 319 matches

select count(*) { ?s rdfs:label ?l.filter(regex(str(?l),"Information","i"))}

-> 363 matches

What is the reason for this behaviour and how can I change it to behave as expected?

Virtuoso version 07.20.3217 on Linux (x86_64-unknown-linux-gnu), Single Server Edition

P.S.: The problem still persists after an upgrade to 07.20.3229.

The problem also occurs on DBpedia, which has the same version right now:

select *
{
  <http://dbpedia.org/resource/Persian_language> dbo:abstract ?l.    
  filter(regex(str(?l),"فارسی","i")).
}

Solution

  • I found an open issue on the Virtuoso GitHub repository regarding this problem at https://github.com/openlink/virtuoso-opensource/issues/705, it seems to be under investigation.

    Thanks to all the commenters for helping with the investigation and for giving great workarounds and alternatives.