Search code examples
angulartypescriptwikipedia-api

Fetch specific Wikipedia list


How can I fetch these records from Wikipedia as easily as possible? I need in a JSON file for each of these areas the displayed names: https://en.wikipedia.org/wiki/Category:Surnames_by_language

Example

[
 {
  name: "Agalliu",
  language: "Albanian"
 },
 {
  name: "Agolli",
  language: "Albanian"
 }
 ...
]

I´m working with Angular5.

Also: Is it legal for me to create a database with the information that the data is from Wikipedia?


Solution

  • I don't work with Angular 5 nor typescript, so I don't know at a technical level how to develop the specific code you need, but I think what you need is to have a look to the HttpClient documentation. This search in GitHub might help you to find some module already developed. Angular seems very well documented, that's very nice. So my answer is more theoretical than technical.

    About the data you want to get in the JSON file, surname and the language of this surname, if you only want to work with the pages in the category I think the best way might me to extract the title of the page of each page and the language from the title of the subcategory analyzed. If you want to do it:

    • You will need to check and clean the title of the categories too. E.g. Irish-language feminine surnames‎ and Irish-language masculine surnames‎ should be cleaned as Irish. It would be nice if you will have another JSON value to keep the title of the category, because it would help you to recover the URL in the future
    • You will need to check if the title of the pages for each surname need to be cleaned, because if you don't clean it, you probably get some values like Hoti (surname). Of course, as in the last point about the category title, I recommend you to create another JSON value to keep the title of the page and keep it due the possible case in with you would need it.

    I think another good way to do it is querying to Wikidata, because there are many pages with structures very different and there isn't an infobox generalized in all of them, what it would make easier to get the data because you would be able to scrape an specific field (language or whatever it may be). However, extract it from Wikidata and no from the category has disadvantages too:

    • If you only want to work with the surnames/pages in the category you mentioned (Surnames by language), work with Wikidata isn't an option because Wikidata probably has a large set of data about it and you will get more surnames than in the category.
    • It is probably that many of the items of each surname doesn't an specific language. May be it hasn't got the property native label (P1705) or it may has the property but with the value surname (multiple languages).
    • And of course, it might have a bigger learning curve because probably you will need to learn about SPARQL and Wikidata Query Service.

    Take a look at MediaWiki API and Wikidata:Data Access.

    "Is it legal for me to create a database with the information that the data is from Wikipedia?"

    Yes, it is perfectly legal. What you have to do is to respect the license. In the case of the English Wikipedia, it is licensing under Creative Commons Attribution-ShareAlike 3.0 Unported. This license allows you to reuse and change the content in a commercial and non-commercial way, but you must attribute the authorship and to share the derivatives with the same license.

    In the case of Wikidata, all in the namespaces of items and properties (Q:* and P:*) are in public domain and marked as CC0, a Creative Commons tool to show that a work is in the public domain. What can you do with the data? Whatever you want.

    I recommend you to read the Creative Commons' FAQ about the CC0 and the legal code of the Creative Commons Attribution-ShareAlike 3.0 Unported.