Search code examples
sparqlwikipediadbpedia

How to handle Wikipedia Named Entities that have the same Category name


I was trying to extract all US companies so I ran the following query

PREFIX cat: <http://dbpedia.org/resource/Category:> 
PREFIX dcterms: <http://purl.org/dc/terms/> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
SELECT DISTINCT ?page ?subcat  WHERE { ?subcat skos:broader* cat:Companies_of_the_United_States_by_industry . 
?page dcterms:subject ?subcat . 
?page  rdfs:label ?pageName. 
}

This is a snapshot of the results enter image description here

Amgen and Pfizer are both companies as well as Category, so I end up collecting everything under Pfizer and Amgen (people, product). I found out that these entries belong to wikipedia category called Category:Wikipedia_categories_named_after_companies_of_the_United_States or Category:Wikipedia_categories_named_after_pharmaceutical_companies_of_the_United_States. So I tried to filter these categories so I did this

SELECT DISTINCT ?page ?subcat  WHERE { ?subcat skos:broader* cat:Companies_of_the_United_States_by_industry . 
?page dcterms:subject ?subcat . 
?page  rdfs:label ?pageName. 
FILTER( !regex(?subcat,"Wikipedia_categories_named_after_pharmaceutical_companies_of_the_United_States")) }

But no luck, they are still there. Any idea how to avoid this problem?


Solution

  • The problem doesn't have anything to do with them having the same name. Wikipedia categories don't form a type hierarchy, so it doesn't make sense to treat them like one. The reason you see the results that you're seeing is that there's a category Pfizer, and that its broader values include the company listings, but is also the dcterms:subject of dbpedia:Alprazolam, dbpedia:Cetirizine, etc. It doesn't make sense as a type hierarchy, but it is fine for organizing article topics. If you only want companies back, just ask for things that are companies:

    SELECT DISTINCT ?page ?subcat  WHERE {
      ?subcat skos:broader* category:Companies_of_the_United_States_by_industry . 
      ?page dcterms:subject ?subcat . 
      ?page rdfs:label ?pageName. 
      ?page a dbpedia-owl:Company
    }
    

    We can clean that up a bit, though. You're not using ?label, so we can remove it. We can use some of the shorter syntaxes to make things a little bit cleaner. We can also note that "Companies … by industry" has a skos:broader value "Companies of the United States" which makes the intent of the query a bit clearer.

    select distinct ?company ?subcategory  where {
      ?company dcterms:subject ?subcategory ;
               a dbpedia-owl:Company .
      ?subcategory skos:broader* category:Companies_of_the_United_States . 
    }
    limit 1000
    

    SPARQL results

    As a final note, the category hierarchy doesn't necessarily mean that each company has a single path to the top category. That is, you could get some company listed multiple times, e.g.:

    company   subcategory
    ------------------------------------
    companyX  Textile_Companies
    companyX  Companies_in_New_Hampshire
    

    Unless you need the listing of subcategories, you might consider eliminating it from the query, in which case you can simply have (using property paths):

    select distinct ?company where {
      ?company a dbpedia-owl:Company ;
               dcterms:subject/skos:broader* category:Companies_of_the_United_States .
    }
    limit 1000
    

    SPARQL results