Search code examples
pythonmediawikipywikibot

How to identify wikipedia categories in python


I am currently using pywikibot to obtain the categories of a given wikipedia page (e.g., support-vector machine) as follows.

import pywikibot as pw

print([i.title() for i in list(pw.Page(pw.Site('en'), 'support-vector machine').categories())])

The results I get is:

[
  'Category:All articles with specifically marked weasel-worded phrases',
  'Category:All articles with unsourced statements',
  'Category:Articles with specifically marked weasel-worded phrases from May 2018',
  'Category:Articles with unsourced statements from June 2013',
  'Category:Articles with unsourced statements from March 2017',
  'Category:Articles with unsourced statements from March 2018',
  'Category:CS1 maint: Uses editors parameter',
  'Category:Classification algorithms',
  'Category:Statistical classification',
  'Category:Support vector machines',
  'Category:Wikipedia articles needing clarification from November 2017',
  'Category:Wikipedia articles with BNF identifiers',
  'Category:Wikipedia articles with GND identifiers',
  'Category:Wikipedia articles with LCCN identifiers'
]

As you can see the results I am getting include lot of tracking and maintenance categories of wikipedia such as;

  • Category:All articles with specifically marked weasel-worded phrases
  • Category:All articles with unsourced statements
  • Category:CS1 maint: Uses editors parameter
  • etc.

However, the categories I am only interested are;

  • Category:Classification algorithms
  • Category:Statistical classification
  • Category:Support vector machines

I am wondering if there is a way to get all tracing or maintenance wikipedia categories, so that I can remove them from the results to get only the informative categories.

Or, please suggest me if there are any other ways of eliminating them from the results.

I am happy to provide more details if needed.


Solution

  • pywikibot currently does not provide some of the API features for filtering hidden categories. You can do that manually by searching for the hidden key in categoryinfo:

    import pywikibot as pw
    
    site = pw.Site('en', 'wikipedia')
    print([
        cat.title()
        for cat in pw.Page(site, 'support-vector machine').categories()
        if 'hidden' not in cat.categoryinfo
    ])
    

    gives:

    ['Category:Classification algorithms', 
     'Category:Statistical classification', 
     'Category:Support vector machines']
    

    See https://www.mediawiki.org/wiki/Help:Categories#Hidden_categories and https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories for more info.