Multiple query on DBpedia endpoint for movie information retrieval using Apache Jena

I'm trying to download some films' information(year of production and title) using Apache Jena and querying the DBpedia public endpoint. I already know that the public endpoint has some security restrictions and for this reason it doesn't grant to use query which return more than 2000 rows in the results set. For this reason, I've tried to subdivide my query in multiple query using the LIMIT and OFFSET option appropriately and with a Java program (http://ideone.com/xF0GCE) I'll save them on a specific file in a formatted manner:

public void movieQuery(String dbpediaFilms) throws IOException {
     String includeNamespaces = "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n" +
      "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n" +
      "PREFIX dcterms: <http://purl.org/dc/terms/>\n" +
      "PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>\n";

       String currQuery = includeNamespaces + "SELECT DISTINCT ?movie (str(?movie_title) as ?title) (str(?movie_year) as ?year) WHERE {\n" +
       " ?movie rdf:type dbpedia-owl:Film.\n" +
       " ?movie rdfs:label ?movie_title.\n" +
" ?movie dcterms:subject ?cat .\n" +
" ?cat rdfs:label ?movie_year .\n" +
" FILTER langMatches(lang(?movie_title), \"EN\") .\n" +
" FILTER regex(?movie_year, \"^[0-9]{4} \", \"i\")\n" +
" } limit 2000 offset ";
      int totalNumberOfFilms = 77794;
      int totNumQuery = 39;
      int offset = 0;
      int currNum = 0;

      for(int i = 1; i <= totNumQuery; i++) {
      try {

         Query query = QueryFactory.create(currQuery + offset);
         currNum += Utils.serializeMappingList(getMovieMappingList(query), dbpediaFilms);

       } catch (Exception ex) {
             ex.printStackTrace();
             throw ex;
        }

      offset += 2000;

      myWait(30);

     }

     System.out.println(currNum);


}

This is the query that I use in order to retrieve the information that I need:

SELECT DISTINCT ?movie (str(?movie_title) as ?title) (str(?movie_year) as ?year) WHERE    { ?movie rdf:type dbpedia-owl:Film.
  ?movie rdfs:label ?movie_title.
  ?movie dcterms:subject ?cat .
  ?cat rdfs:label ?movie_year .
  FILTER langMatches(lang(?movie_title), "EN") .
  FILTER regex(?movie_year, "^[0-9]{4} ", "i")
  } limit 2000 offset $specific_offset

As you can see in the java code, I increments a variable (offset) by 2000 in order to acquire the correct results set partition.

Running a preliminar query, I've seen that the total number of distinct films in DBpedia is 77794 using this query:

select distinct count(?film) where {
?film rdf:type dbpedia-owl:Film.
}

The problem is that if I count the number of obtained nodes, is equal to 76000 so I think that I've missed a lot of films using this procedure. Someone can say to me how can I get correctly the whole results set? Am I forced to query a local DBpedia dump in order to correctly get the results?

Thank you a lot in advance.

EDIT: I've created a new query using the useful suggestion @Joshua Taylor:

SELECT DISTINCT ?movie (str(?movie_year) as ?year) (str(?movie_title) as ?title)  WHERE {
?movie rdf:type dbpedia-owl:Film.
movie rdfs:label ?movie_title.
FILTER langMatches(lang(?movie_title), \"EN\") .
optional { ?movie dbpprop:released   ?rel_year }
optional{?movie dbpedia-owl:releaseDate ?owl_year}
 optional {?movie dcterms:subject ?sub.
 ?sub rdfs:label ?movie_year_sub
 filter regex(?movie_year_sub, ".*[0-9]{4}.*", "i") }
 BIND(COALESCE(?owl_year, ?rel_year, ?movie_year_sub) AS ?movie_year)
 } group by ?movie limit 2000 offset $specific_offset

Using the group by specification, the virtuoso endpoint let me to get the correct results set which doesn't have duplicates row. Instead when I try to run the query using apache Jena, I'm not able to execute it because I receive the following error:

com.hp.hpl.jena.query.QueryParseException: Non-group key variable in SELECT: ?movie_year in expression str(?movie_year)

Solution

There are more films than those that satisfy your original query, and your query doesn't necessarily count each movie just once. There's big difference between select distinct (count(?var) as ?nVar) … and select (count(distinct ?var) as ?nVar) …. The first only shows you distinct counts, whereas the second shows you the number of distinct bindings.

You can get more than one result row for each movie. In this part of your query:

?movie rdf:type dbpedia-owl:Film.
?movie dcterms:subject ?cat .
?cat rdfs:label ?movie_year .
FILTER regex(?movie_year, "^[0-9]{4} ", "i")

you'll get a result row for a each matching label of each category to which the movie belongs. E.g, if some film is in categories 1984's Worst Movies and 2010 Film Remakes, you'll get two result rows.

There are also legitimate films that you won't be counting, because some films might not have an English movie title or a category that begins with a year.

I'm not sure whether you'll be able to get entirely satisfactory results, since it appears that DBpedia just doesn't reliably have the data that you want. That said, try a query like this to get started. It will get all films, and and will (hopefully) pull out enough information to get dates in many cases. Some the dbpprop:released values are very strange though, and I don't know how useful they'll be to you.

select * where { 
  ?film a dbpedia-owl:Film 
  optional { ?film dbpprop:released        ?released    }
  optional { ?film dbpedia-owl:releaseDate ?releaseDate }
  optional { ?film dcterms:subject [ rdfs:label ?catLabel ]
             filter( regex( ?catLabel, "^[0-9]{4}.*films", "i" ) )
           }
}
order by ?film 
limit 100

SPARQL results

Update after new query

The query that you've posted that doesn't work with Jena (because it's not legal SPARQL, even though Virtuoso accepts it) can be fixed in a few different ways, depending on what exactly you want. The simplest, most direct way, is simply not to group on anything.

SELECT DISTINCT ?movie (str(?movie_year) as ?year) (str(?movie_title) as ?title)
WHERE {
  ?movie rdf:type dbpedia-owl:Film.
  ?movie rdfs:label ?movie_title.
  FILTER langMatches(lang(?movie_title), 'en')
  optional { ?movie dbpprop:released   ?rel_year }
  optional { ?movie dbpedia-owl:releaseDate ?owl_year}
  optional { ?movie dcterms:subject ?sub.
             ?sub rdfs:label ?movie_year_sub
             filter regex(?movie_year_sub, ".*[0-9]{4}.*", "i")
           }
  BIND(COALESCE(?owl_year, ?rel_year, ?movie_year_sub) AS ?movie_year)
}
limit 2000

If you do that, though, you'll get multiple results when you have multiple English movie titles, release years, etc. If you want to avoid that, then you want to group by ?movie. Jena's right to reject things like

select ?movie (str(?movie_title) as ?title) where { 
  ?movie :hasTitle ?movie_title
}
group by ?movie

because str(?movie_title) doesn't make sense. For each ?movie, you've actually a set of ?movie_titles. You need to get a representative title from that set. Now, it doesn't actually look like any movie has more than one English title. You can check with a query like:

SELECT ?movie (count(?mTitle) as ?nTitles)
WHERE {
  ?movie a dbpedia-owl:Film ;
         rdfs:label ?mTitle .
  filter langMatches(lang(?mTitle),'en')
}
group by ?movie
having count(?mTitle) > 1

Given that, it means that you can safely group by ?movie ?movie_title which will let you use ?movie_title in the projection variable list. But what to do about the release date? You could still end up with more than one of those, in principle. The data does give you more than one, in fact, as you can see with this query:

SELECT DISTINCT ?movie (group_concat(?movie_year;separator=';') as ?years)
WHERE {
  ?movie rdf:type dbpedia-owl:Film.
  ?movie rdfs:label ?movie_title.
  FILTER langMatches(lang(?movie_title), 'en')
  optional { ?movie dbpprop:released   ?rel_year }
  optional { ?movie dbpedia-owl:releaseDate ?owl_year}
  optional { ?movie dcterms:subject ?sub.
             ?sub rdfs:label ?movie_year_sub
             filter regex(?movie_year_sub, ".*[0-9]{4}.*", "i")
           }
  BIND(COALESCE(?owl_year, ?rel_year, ?movie_year_sub) AS ?movie_year)
}
group by ?movie ?movie_title 
having count(?movie_year) > 1
limit 2000

This means that you'll need to get a value based on that set. SPARQL gives you a few functions to do that (e.g., max, min, sum). In this case, I don't know if there's an easy way to pick the "best" representative, so you might just want to sample from it, giving you a query like this:

SELECT DISTINCT ?movie (str(sample(?movie_year)) as ?year) ?movie_title
WHERE {
  ?movie rdf:type dbpedia-owl:Film.
  ?movie rdfs:label ?movie_title.
  FILTER langMatches(lang(?movie_title), 'en')
  optional { ?movie dbpprop:released   ?rel_year }
  optional { ?movie dbpedia-owl:releaseDate ?owl_year}
  optional { ?movie dcterms:subject ?sub.
             ?sub rdfs:label ?movie_year_sub
             filter regex(?movie_year_sub, ".*[0-9]{4}.*", "i")
           }
  BIND(COALESCE(?owl_year, ?rel_year, ?movie_year_sub) AS ?movie_year)
}
group by ?movie ?movie_title 
limit 2000

SPARQL results

This is legal SPARQL, confirmed by the sparql.org validator (once you provide some prefix definitions), so Jena should be fine with it, and Virtuoso (in this case, the DBpedia endpoint) accepts it, too.