Search code examples
regexfiltersparqlrdfeuropeana-api

SPARQL: combine and exclude regex filters


I want to filter my SPARQL query for specific keywords while at the same time excluding other keywords. I thought this may be easily accomplished with FILTER (regex(str(?var),"includedKeyword","i") && !regex(str(?var),"excludedKeyword","i")). It works without the "!" condition, but not with. I also separated the FILTER statements, but no use.

I used this query on http://europeana.ontotext.com/ :

PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX edm: <http://www.europeana.eu/schemas/edm/>
PREFIX ore: <http://www.openarchives.org/ore/terms/>

        SELECT DISTINCT ?CHO
        WHERE {
          ?proxy dc:subject ?subject .
          FILTER ( regex(str(?subject),"gemälde","i") && !regex(str(?subject),"Fotografie","i") )
          ?proxy edm:type "IMAGE" .
          ?proxy ore:proxyFor ?CHO.
          ?agg edm:aggregatedCHO ?CHO; edm:country "germany".
        }

But I always get the result on the first row with the title "Gemäldegalerie", which has a dc:subject of "Fotografie" (the one I want excluded). I think the problem lies in the fact that one object from the Europeana database can have more than one dc:subject property, so maybe it looks only for one of these properties while ignoring the other ones.

Any ideas? Would be very thankful!


Solution

  • The problem is that your combined filter checks for the same binding of ?subject. So it succeeds if at least one value of ?subject matches both conditions (which is almost always true, because the string "Gemäldegalerie", for example, matches your first regex and does not match the second).

    So for the negative condition, you need to formulate something that checks for all possible values, rather than just one particular value. You can do this using SPARQL's NOT EXISTS function, for example like this:

      PREFIX dc: <http://purl.org/dc/elements/1.1/>
      PREFIX edm: <http://www.europeana.eu/schemas/edm/>
      PREFIX ore: <http://www.openarchives.org/ore/terms/>
    
      SELECT DISTINCT ?CHO
      WHERE {
          ?proxy edm:type "IMAGE" .
          ?proxy ore:proxyFor ?CHO.
          ?agg edm:aggregatedCHO ?CHO; edm:country "germany".
          ?proxy dc:subject ?subject . 
          FILTER(regex(str(?subject),"gemälde","i")) 
          FILTER NOT EXISTS { 
                ?proxy dc:subject ?otherSubject. 
                FILTER(regex(str(?otherSubject),"Fotografie","i")) 
          }
        }
    

    As an aside: since you are doing regular expression checks, and now combining them with an NOT EXISTS operator, this is likely to become very expensive for the query processor quite quickly. You may want to think about alternative ways to formulate your query (for example, using the exact subject string to include or exclude to eliminate the regex), or even having a look at some non-standard extensions that the SPARQL endpoint might provide (OWLIM, for example, the store on which the Europeana endpoint runs, supports various full-text-search extensions, though I am not sure they are enabled in the Europeana endpoint).