Search code examples
treesparqlontologyvirtuoso

Extract all parents of a given node


I'm trying to extract all parents of a each given GO Id (a node) using EBI-RDF sparql endpoint, I was based on this two similar questions to formulate the query, here're two examples illustrating the problem:

Example 1 (Link to the structure):

biological_process (GO:0008150)
           |__ metabolic process (GO:0008152)
                           |__ methylation (GO:0032259)

In this example, using the following query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbpedia2: <http://dbpedia.org/property/>
PREFIX dbpedia: <http://dbpedia.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

PREFIX obo: <http://purl.obolibrary.org/obo/>

SELECT (count(?mid) as ?depth)
       (group_concat(distinct ?midId ; separator = " / ") AS ?treePath) 
FROM <http://rdf.ebi.ac.uk/dataset/go> 
WHERE {
    obo:GO_0032259 rdfs:subClassOf* ?mid .
    ?mid rdfs:subClassOf* ?class .
    ?mid <http://www.geneontology.org/formats/oboInOwl#id> ?midId.
}
GROUP BY ?treePath
ORDER BY ?depth

I got the desired results without problems:

c |              treePath
--|-------------------------------------
6 | GO:0008150 / GO:0008152 / GO:0032259

But when the term exists in multiple branches (e.g GO:0007267) as in the case below, the previous approach didn't work:

Example 2 (Link to the structure)

biological_process (GO:0008150)
           |__ cellular_process (GO:0009987)
           |           |__ cell communication (GO:0007154)
           |                       |__ cell-cell signaling (GO:0007267)
           |
           |__ signaling (GO:0023052)
                      |__ cell-cell signaling (GO:0007267)

The result:

c |                            treePath
--|---------------------------------------------------------------
15| GO:0007154 / GO:0007267 / GO:0008150 / GO:0009987 / GO:0023052

What I wanted to get is the following:

GO:0008150 / GO:0009987 / GO:0007154 / GO:0007267
GO:0008150 / GO:0023052 / GO:0007267

What I understood is that under the hood I'm calculating the depth of each level and using it to construct the path, this works fine when we have an element that belongs only to one branch.

SELECT (count(?mid) as ?depth) ?midId
FROM <http://rdf.ebi.ac.uk/dataset/go> 
WHERE {
    obo:GO_0032259 rdfs:subClassOf* ?mid .
    ?mid rdfs:subClassOf* ?class .
    ?mid <http://www.geneontology.org/formats/oboInOwl#id> ?midId.
}
GROUP BY ?midId
ORDER BY ?depth

The result:

depth |   midId
------|------------
1     | GO:0008150
2     | GO:0008152
3     | GO:0032259

In the second example, things are missed up and I didn't get why, in any ways I'm sure that part of the problem are terms that have the same depth/level, but I don't know how can I solve this.

depth |   midId
------|------------
2     | GO:0008150
2     | GO:0009987
2     | GO:0023052
3     | GO:0007154
6     | GO:0007267

Solution

  • Thanks to @AKSW I found a decent solution using HyperGraphQL (a GraphQL interface for querying and serving linked data on the Web).

    I'll leave the detailed answer here, it may help someone.

    1. I downloaded and set up HyperGraphQL download page
    2. Linked it to EBI Sparql endpoint as described in this tutorial

      The config.json file I used:

      {
          "name": "ebi-hgql",
          "schema": "ebischema.graphql",
          "server": {
              "port": 8081,
              "graphql": "/graphql",
              "graphiql": "/graphiql"
          },
          "services": [
              {
                  "id": "ebi-sparql",
                  "type": "SPARQLEndpointService",
                  "url": "http://www.ebi.ac.uk/rdf/services/sparql",
                  "graph": "http://rdf.ebi.ac.uk/dataset/go",
                  "user": "",
                  "password": ""
              }
          ]
      }
      

      Here's how my ebischema.graphql file looks like (Since I needed only the Class, id, label and subClassOf):

      type __Context {
          Class:          _@href(iri: "http://www.w3.org/2002/07/owl#Class")
          id:             _@href(iri: "http://www.geneontology.org/formats/oboInOwl#id")
          label:          _@href(iri: "http://www.w3.org/2000/01/rdf-schema#label")
          subClassOf:     _@href(iri: "http://www.w3.org/2000/01/rdf-schema#subClassOf")
      }
      
      type Class @service(id:"ebi-sparql") {
          id: [String] @service(id:"ebi-sparql")
          label: [String] @service(id:"ebi-sparql")
          subClassOf: [Class] @service(id:"ebi-sparql")
      }
      
    3. I started testing some simple query, but constantly getting an empty response; the answer to this issue solved my problem.

    4. Finally I constructed the query to get the tree

      Using this query:

      {
        Class_GET_BY_ID(uris:[
          "http://purl.obolibrary.org/obo/GO_0032259",
          "http://purl.obolibrary.org/obo/GO_0007267"]) {
          id
          label
          subClassOf {
            id
            label
            subClassOf {
              id
              label
            }
          }
        }
      }
      

      I got some interesting results:

      {
        "extensions": {},
        "data": {
          "@context": {
            "_type": "@type",
            "_id": "@id",
            "id": "http://www.geneontology.org/formats/oboInOwl#id",
            "label": "http://www.w3.org/2000/01/rdf-schema#label",
            "Class_GET_BY_ID": "http://hypergraphql.org/query/Class_GET_BY_ID",
            "subClassOf": "http://www.w3.org/2000/01/rdf-schema#subClassOf"
          },
          "Class_GET_BY_ID": [
            {
              "id": [
                "GO:0032259"
              ],
              "label": [
                "methylation"
              ],
              "subClassOf": [
                {
                  "id": [
                    "GO:0008152"
                  ],
                  "label": [
                    "metabolic process"
                  ],
                  "subClassOf": [
                    {
                      "id": [
                        "GO:0008150"
                      ],
                      "label": [
                        "biological_process"
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "id": [
                "GO:0007267"
              ],
              "label": [
                "cell-cell signaling"
              ],
              "subClassOf": [
                {
                  "id": [
                    "GO:0007154"
                  ],
                  "label": [
                    "cell communication"
                  ],
                  "subClassOf": [
                    {
                      "id": [
                        "GO:0009987"
                      ],
                      "label": [
                        "cellular process"
                      ]
                    }
                  ]
                },
                {
                  "id": [
                    "GO:0023052"
                  ],
                  "label": [
                    "signaling"
                  ],
                  "subClassOf": [
                    {
                      "id": [
                        "GO:0008150"
                      ],
                      "label": [
                        "biological_process"
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        },
        "errors": []
      }
      

    EDIT

    This was exactly what I wanted, but I noticed that I can't add another sublevel like this:

    {
      Class_GET_BY_ID(uris:[
        "http://purl.obolibrary.org/obo/GO_0032259",
        "http://purl.obolibrary.org/obo/GO_0007267"]) {
        id
        label
        subClassOf {
          id
          label
          subClassOf {
            id
            label
            subClassOf {  # <--- 4th sublevel
              id
              label
            }
          }
        }
      }
    }
    

    I created a new question: Endpoint returned Content-Type: text/html which is not recognized for SELECT queries