Search code examples
sparqlwikidata

Wikidata: Get all non-classical Musicians via SPARQL query


I hope that this kind of question is allowed here as it is more a Wikidata specific question. Anyways, I try to get all non-classical-music musicians from Wikidata by SPARQL. Right now I have this code:

SELECT ?value ?valueLabel ?born WHERE {
  {
    SELECT DISTINCT ?value ?born WHERE {
      ?value wdt:P31 wd:Q5 . # all Humans
      ?value wdt:P106/wdt:P279* wd:Q639669 . # of occupation or subclass of occupation is  musician
      ?value wdt:P569 ?born . # Birthdate
      FILTER(?born >= "1981-01-01T00:00:00Z"^^xsd:dateTime) # filter by Birthyear
    }
    ORDER BY ASC(?born)
    #LIMIT 500
  }

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,ger". }
}

this gets me (theoretically) all People whose occupation is Musician (https://www.wikidata.org/wiki/Q639669) and who were born after 1900. (Theoretically because this query runs way too long and I had to break it into smaller chunks)

What I am after however is to exclude People who are primary classical musicians. Is there any property I am not aware of? Otherwise, how would I change my query to be able to filter by specific properties (like Q21680663, classical composer)?

Thanks!


Solution

  • If you check the Examples tab in the query interface and type music into the search field, you'll find an example that almost hits the spot: Musicians or singers that have a genre containing 'rock'.

    I've used that mostly to just get a list of all musicians with their genres. I finally settled on a MINUS query subtracting any musician who touches western classical music or baroque music, the latter included specifically to get Bach, the old bastard.

    SELECT DISTINCT 
        ?human ?humanLabel 
        (GROUP_CONCAT(DISTINCT ?genreLabel; SEPARATOR = ", ") AS ?genres) 
    WHERE {
        {
            ?human wdt:P31 wd:Q5;
                   wdt:P106 wd:Q639669;
                   wdt:P136 ?genre.
        } MINUS {
            VALUES ?classics {
                wd:Q9730
                wd:Q8361
            }
            ?human wdt:P136 ?classics.
         }
      
      
      # This is just boilerplate to get the labels.
      # it's slightly faster this way than the label
      # service, and the query is close to timing out already
      ?genre rdfs:label ?genreLabel.
      FILTER((LANG(?genreLabel)) = "en")
      ?human rdfs:label ?humanLabel.
      FILTER((LANG(?humanLabel)) = "en")
    }
    GROUP BY ?humanLabel ?human
    

    In the Query Interface: 25,000 results in 20sec

    Here's a taste of what the results look like (from some intermediate version, because I'm not redoing the table now).

    artist genres
    Gigi D'Agostino Latin jazz, Italo dance
    Erykah Badu neo soul, soul music
    Yoko Kanno jazz, blues, pop music, J-pop, film score, New-age music, art rock, ambient music
    Michael Franks pop music, rock music
    Harry Nilsson rock music, pop music, soft rock, baroque pop, psychedelic rock, sunshine pop
    Yulia Nachalova jazz, pop music, soul music, contemporary R&B, blue-eyed soul, estrada
    Linda McCartney pop rock

    From the original example, you may want to try also including singers. The following, replacing the existing line with "P106" does that, and results in about twice as many results. But it often times out.

      VALUES ?professions {
         wd:Q177220
         wd:Q639669
      }
      wdt:P106 ?professions;    
    

    Query including singers, 53,000 results but may time out

    The example also uses the following to cut down results rather drastically, by including only items with a certain number of statements, assuming those correlate with... something. You may want to experiment with it to focus on the most significant results, or to give you room to avoid the timeout with other changes. Maybe trying lower limits than 50 to find the right balance is a good idea, though.

    ?human wikibase:statements ?statementcount.        
    FILTER(?statementcount > 50 )
    

    A query with singers and the statement limit

    This is an earlier version. It excludes all the listed genres, but includes any musician linked to any other genre, and there are many of them that would probably qualify as "classics". The filter uses the "NOT IN" construct, which seems cleaner to me than filtering based on labels.

    SELECT DISTINCT 
        ?human ?humanLabel 
        (GROUP_CONCAT(DISTINCT ?genreLabel; SEPARATOR = ", ") AS ?genres) 
    WHERE {
       ?human wdt:P31 wd:Q5;
              wdt:P106 wd:Q639669;
              wdt:P136 ?genre.
        
        # The "MAGIC": Q9730 is "Western Classical Music"
        # Q1344 is "opera"
        # Then I noticed Amadeus, Wagner, and Bach all slipped through and expanded the list, and it's a really
        # ugly way of doing this
        FILTER(?genre NOT IN(wd:Q9730, wd:Q1344, wd:Q9734, wd:Q9748, wd:Q189201, wd:Q8361, wd:Q2142754, wd:Q937364, wd:Q1546995, wd:Q1746028, wd:Q207338, wd:Q3328774, wd:Q1065742))
    
        ?genre rdfs:label ?genreLabel.
        FILTER((LANG(?genreLabel)) = "en")
        ?human rdfs:label ?humanLabel.
        FILTER((LANG(?humanLabel)) = "en")
    }
    GROUP BY ?humanLabel ?human
    

    This gets me 26,000 results. View in Query Interface

    Note that this will still return artists that have "western classical music" among their genres, aw long as they are also linked to other genres. To exclude any musician ever dabbling in the classics, you'll have to start a daytime top-30 radio station use a MINUS construct to, essentially, subtract all those.