I was trying to extract all US companies so I ran the following query
PREFIX cat: <http://dbpedia.org/resource/Category:>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?page ?subcat WHERE { ?subcat skos:broader* cat:Companies_of_the_United_States_by_industry .
?page dcterms:subject ?subcat .
?page rdfs:label ?pageName.
}
This is a snapshot of the results
Amgen and Pfizer are both companies as well as Category, so I end up collecting everything under Pfizer and Amgen (people, product). I found out that these entries belong to wikipedia category called Category:Wikipedia_categories_named_after_companies_of_the_United_States or Category:Wikipedia_categories_named_after_pharmaceutical_companies_of_the_United_States. So I tried to filter these categories so I did this
SELECT DISTINCT ?page ?subcat WHERE { ?subcat skos:broader* cat:Companies_of_the_United_States_by_industry .
?page dcterms:subject ?subcat .
?page rdfs:label ?pageName.
FILTER( !regex(?subcat,"Wikipedia_categories_named_after_pharmaceutical_companies_of_the_United_States")) }
But no luck, they are still there. Any idea how to avoid this problem?
The problem doesn't have anything to do with them having the same name. Wikipedia categories don't form a type hierarchy, so it doesn't make sense to treat them like one. The reason you see the results that you're seeing is that there's a category Pfizer, and that its broader values include the company listings, but is also the dcterms:subject of dbpedia:Alprazolam, dbpedia:Cetirizine, etc. It doesn't make sense as a type hierarchy, but it is fine for organizing article topics. If you only want companies back, just ask for things that are companies:
SELECT DISTINCT ?page ?subcat WHERE {
?subcat skos:broader* category:Companies_of_the_United_States_by_industry .
?page dcterms:subject ?subcat .
?page rdfs:label ?pageName.
?page a dbpedia-owl:Company
}
We can clean that up a bit, though. You're not using ?label, so we can remove it. We can use some of the shorter syntaxes to make things a little bit cleaner. We can also note that "Companies … by industry" has a skos:broader value "Companies of the United States" which makes the intent of the query a bit clearer.
select distinct ?company ?subcategory where {
?company dcterms:subject ?subcategory ;
a dbpedia-owl:Company .
?subcategory skos:broader* category:Companies_of_the_United_States .
}
limit 1000
As a final note, the category hierarchy doesn't necessarily mean that each company has a single path to the top category. That is, you could get some company listed multiple times, e.g.:
company subcategory
------------------------------------
companyX Textile_Companies
companyX Companies_in_New_Hampshire
Unless you need the listing of subcategories, you might consider eliminating it from the query, in which case you can simply have (using property paths):
select distinct ?company where {
?company a dbpedia-owl:Company ;
dcterms:subject/skos:broader* category:Companies_of_the_United_States .
}
limit 1000