Search code examples
wikipediawikidata

How much of Wikidata is organic (user-entered independent of Wikipedia)?


I'm trying to figure out how much of Wikidata's entries are "organic", in the sense of data being entered by humans and independently of Wikipedia.

  1. The Wikidata introduction page says that "Automated bots also enter data into Wikidata." Are there any statistics for how much Wikidata data was entered by bots?

  2. I know that Wikidata is an independent organization from Wikipedia. Are there any statistics on how many Wikidata entries were sourced from Wikipedia? (E.g. A person reads a Wikipedia article, finds a fact that isn't in Wikidata, and then enters the fact into Wikidata using that Wikipedia article as a reference.)

I am familiar with Wikidata's SPARQL API and can look up anything that may be needed to figure out these questions.


Solution

  • When you check the "recent changes" (and deactivate the "only humans" filter), or the history of any specific page/item, the bots are marked with a little 'b', and their names also end on "...Bot".

    If you measure just by "number of statements", bots probably add the majority of data. If you weigh by importance/number of views, humans are probably ahead.

    A group you didn't mention, but that might the significant, is "in between": people using either OpenRefine or QuickStatements to semi-manually match ("reconcile") some external dataset and import it. The computational biology community, for example, does use Wikidata as a sort of hub in this form.

    Imports from Wikipedia provide a lot of the structure, because every page gets its wikidata item (and only one). But most of the data comes from other public dataset.

    For reasons beyond my understanding, relations between some wikipedias and wikidata aren't always perfect. And because each individual project has a lot of freedom in such matters, some have moved away from using Wikidata as their backend for storing structured information, and are doing their own thing. When that happens, either someone continues syncing it at least in one direction. Or the data starts diverging. Most recently, the English Wikipedia has decided to do use some home-grown method of managing short page discriptions, for example.

    (Edit, to answer a question from the comments:) Quality-control of bot data is generally identical to other edits, except that bot edits (and similar, such as those using QuickStatement) are tagged as such.

    The overview of recent pages draws attentions to any change, as does the ability to add items to your personal watchlist. There is also an AI system (the same as on en.wikipedia.org) that predicts bad-faith and low-quality edits, which are tagged as such, highlighted in changes, and available in filters. Related edits by the same user are also combined into "editgroups", and this page shows recent ones. Properties also have numerous constraints, such as required dates of birth and death to be in the past, requiring objects for a "citizenship" property to be people, and so on. Violations of these constraints are marked with (!) on the item's page, and also in various lists. So the property "awards received" requires the object to be a person/creative work/organisation/etc. About 8000 violations are listed here, and clicking on one shows a case where a person is missing the statement "is a: person".