Search code examples
mergesparqluniongraphdbnamed-graphs

SPARQL query for merging RDF Data Cubes


I'm engaging in a project that stores 2 RDF Data Cubes:

  • Climate Data Cube : humidity-dataset, rainfall-dataset, temperature-dataset
  • Industry Data Cube : industry-dataset Both data cubes are stored on GraphDB Database as named graphs. Each dataset of these graphs both have the same dimension: time and year. Now I need to merge these dataset together for data-exploring. Assume we the observations below that contain the data of climate and industry of Ha Noi city in 2016-2017:

graph : http://sda-research.ml/graph/climate

Dataset-climate

ds:obs5 a qb:Observation;
  qb:dataSet ds:dataset-climate;
  prop:city "Ha Noi"@en;
  prop:cityid "hanoi";
  prop:humidity 8.17E1;
  prop:rainfall 2.1668E3;
  prop:year "2016"^^xsd:int .


ds:obs6 a qb:Observation;
  qb:dataSet ds:dataset-climate;
  prop:city "Ha Noi"@en;
  prop:cityid "hanoi";
  prop:humidity 8.18E1;
  prop:rainfall 2.6402E3;
  prop:year "2017"^^xsd:int .

graph : http://sda-research.ml/graph/industry

Dataset-industry

ds:obs205 a qb:Observation;
  qb:dataSet ds:dataset-industry;
  prop:city "Hà Nội"@en;
  prop:cityid "hanoi";
  prop:industry 1.073E2;
  prop:year "2016"^^xsd:int .

ds:obs206 a qb:Observation;
  qb:dataSet ds:dataset-industry;
  prop:city "Hà Nội"@en;
  prop:cityid "hanoi";
  prop:industry 1.07E2;
  prop:year "2017"^^xsd:int .

Now I want to merge 2 graphs for the output that contain humidity and industry value of Hanoi in 2016-2017. On GraphDB SPARQL Endpoint, I used this query:

PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX prop: <http://www.sda-research.ml/dc/prop/>
select ?city ?year ?temperature ?industry
where{
     {graph ?g {
            ?obs a qb:Observation. 
            ?obs prop:cityid ?cityid filter regex(?cityid, 'hanoi').
            ?obs prop:city ?city. 
            ?obs prop:year ?year filter(?year >= 2017 && ?year <= 2018 ).
            ?obs prop:temperature ?temperature.
            }
      }
  UNION 
     {graph ?g {
             ?obs a qb:Observation. 
             ?obs prop:cityid ?cityid filter regex(?cityid, 'hanoi').
             ?obs prop:city ?city.
             ?obs prop:year ?year filter(?year >= 2016 && ?year <= 2017).
             ?obs prop:industry ?industry.
             }
      }
}

Expected output:

city------year------humidity------industry---
Ha Noi-----2016-------8.17E1------ 1.073E2---
Ha Noi-----2017-------8.18E1-------1.07E2----

Actual output:

city------year------humidity------industry--
Ha Noi-----2016-------8.17E1--------null----
Ha Noi-----2017-------8.18E1--------null----
Ha Noi-----2016--------null--------1.073E2--
Ha Noi-----2017--------null--------1.07E2---

How can I remove the null value when using UNION, or do you have any query that give the correctly expected result?


Solution

  • There are several issues with your query before we get into the SPARQL itself.

    1. Your dataset contains humidity, but you are querying temperature.
    2. The years that you are querying do not match, except for 2017: In the first graph you are looking at 2017 and 2018, in the second, you are looking at 2016 and 2017. This may be fine in certain cases, but it will not produce the result you expect.

    Now in terms of SPARQL issues.

    1. You query both ?cityid and ?city, but the value of ?city is spelt differently across named graphs, namely "Hà Nội"@en and "Ha Noi"@en.
    2. Your observations are not the same resource across named graphs.
    3. You use only one variable, ?g for your named graphs. This means that the 2/4 results are obtained by looking at the climate graph, whereas the second two results by looking at the industry graph. When you have a specific graph in mind from which to extract sources, you should specify it.
    4. When you have a specific city in mind, I would avoid using the REGEX. Different triplestores implement query planning differently, but this is an expensive operation that may significantly worsen your performance. See below for how to deal with this by using the values keyword.

    Now here is a slightly amended query that produces the results you're after:

        PREFIX qb: <http://purl.org/linked-data/cube#>
        PREFIX prop: <http://www.sda-research.ml/dc/prop/>
        
        select ?cityid ?year ?humidity ?industry
        where{
         values ?cityid {'hanoi'}
        graph <http://sda-research.ml/graph/climate> {
              ?obs1 a qb:Observation.
              ?obs1 prop:cityid ?cityid.
              ?obs1 prop:year ?year filter(?year >= 2016 && ?year <= 2017 ).
              ?obs1 prop:humidity ?humidity.
                    }
        
        graph <http://sda-research.ml/graph/industry> {
              ?obs2 a qb:Observation.
              ?obs2 prop:cityid ?cityid.
              ?obs2 prop:year ?year filter(?year >= 2016 && ?year <= 2017).
              ?obs2 prop:industry ?industry.
                     }
        
        }