Search code examples
sparqlwikidata

How to adjust sparql query to return even brief info


I am new on this side, the question-asking side, so please tell me if you need any additional information.

I have a dataset with 2900 entries consisting mostly Dutch and Flemish poets. I want to add information to this dataframe by querying wikidata; gender, nationality, day of birth, day of death. Now how many poets can two small countries have? Not all of them are to be found on wikidata (I'm going to take care of that later), and for the ones that are, the info is sometimes very scarce.

I have used the following query:

import requests

def get_data_for_poet(poet):
    url = 'https://query.wikidata.org/sparql'
    query = '''
    prefix schema: <http://schema.org/>
            SELECT ?item ?occupation ?genderLabel ?bdayLabel ?bnatLabel ?deathLabel
            WHERE {
                ?item ?label "''' + poet + '''"@en.
                ?item wdt:P106 ?occupation .
                ?item wdt:P21 ?gender .
                ?item wdt:P569 ?bday .
                ?item wdt:P27 ?bnat .
                ?item wdt:P570 ?death .

            SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
        }
'''

r = requests.get(url, params = {'format': 'json', 'query': query})
try:
    #print(r.content)
    data = r.json()
    return {
        'gender': data['results']['bindings'][0]['genderLabel']['value'],
        'birthday': data['results']['bindings'][0]['bdayLabel']['value'],
        'death': data['results']['bindings'][0]['deathLabel']['value'],
        'nationality': data['results']['bindings'][0]['bnatLabel']['value'],
    }
except:
    return {
        'gender': 'Onbekend',
        'birthday': 'Onbekend',
       'death' : 'Onbekend',
        'nationality': 'Onbekend'
    }

Then I ran the following code for the first 20 entries:

import time
import math

def get_poet_metadata_for_row(row):
    f = math.floor(row['index']/80) # the API returns errors 472 if it goes any faster
    print(row['index'])
    time.sleep(1+f)
    poet = row['Dichter']    
          
    
    if poet == 'Onbekend':
        return pd.Series(['Onbekend', 'Onbekend', 'Onbekend', 'Onbekend'])
                         
    data = get_data_for_poet(poet)
    
    print(data)
    
    poets[poet] = data
    
    return pd.Series([data['birthday'],data['nationality'],data['gender'],data['death']])


df[['Geboortedatum','Nationaliteit', 'Geslacht', 'Gestorven']] = df[:20].apply(get_poet_metadata_for_row, axis=1)

But unfortunately, I noticed that the query only returns information when for a Q ID all four pieces of information is available.

This is a piece of the output:

12
{'gender': 'male', 'birthday': '1934-08-04T00:00:00Z', 'death': '2012-07-11T00:00:00Z', 'nationality': 'Kingdom of the Netherlands'}
13
{'gender': 'Onbekend', 'birthday': 'Onbekend', 'death': 'Onbekend', 'nationality': 'Onbekend'}
14
{'gender': 'Onbekend', 'birthday': 'Onbekend', 'death': 'Onbekend', 'nationality': 'Onbekend'}
15
{'gender': 'Onbekend', 'birthday': 'Onbekend', 'death': 'Onbekend', 'nationality': 'Onbekend'}
16
{'gender': 'Onbekend', 'birthday': 'Onbekend', 'death': 'Onbekend', 'nationality': 'Onbekend'}

I have then tried to query this information one by one (first gender, then birthday, etc.), but that literally takes forever.

How can I adjust the query so that all information is returned, even if, let's say, only gender is known? I tried some things with OPTIONAL but it's getting messy real fast. I am new to SPARQL, so any help is appreciated.

Also, I may suffer from tunnel vision considering the time I have spent on this dataset, but if there is python package that can do exactly this I would love to know.


Solution

  • The intuition of using OPTIONAL is correct. You have to add it for every single information that you want to consider optional (i.e. not necessary).

    Furthermore, for avoiding false positives, I think you should also use rdfs:label instead of a generic ?label (which can refer to any property).

    PREFIX schema: <http://schema.org/>
    SELECT ?item ?occupation ?genderLabel ?bdayLabel ?bnatLabel ?deathLabel
    WHERE {
        ?item rdfs:label "Marc Tritsmans"@en.
        ?item wdt:P106 ?occupation .
        OPTIONAL { ?item wdt:P21 ?gender . }
        OPTIONAL { ?item wdt:P569 ?bday . }
        OPTIONAL { ?item wdt:P27 ?bnat . }
        OPTIONAL { ?item wdt:P570 ?death . }
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
    }
    

    See a demo here.