Search code examples
pythonluawikitextscribunto

Expanding a Scribunto module that doesn't have a function


I want to get the return value of this Wikimedia Scribunto module in Python. Its source code is roughly like this:

local Languages = {}

Languages = {
    ["aa"] = {
        name = "afarština",
        dir = "ltr",
        name_attr_gen_pl = "afarských"
    },
    -- More languages...
    ["zza"] = {
        name = "zazaki",
        dir = "ltr"
    }
}

return Languages 

In the Wiktextract library, there is already Python code to accomplish similar tasks:

def expand_template(sub_domain: str, text: str) -> str:
    import requests

    # https://www.mediawiki.org/wiki/API:Expandtemplates
    params = {
        "action": "expandtemplates",
        "format": "json",
        "text": text,
        "prop": "wikitext",
        "formatversion": "2",
    }
    r = requests.get(f"https://{sub_domain}.wiktionary.org/w/api.php",
                     params=params)
    data = r.json()
    return data["expandtemplates"]["wikitext"]

This works for languages like French because there the Scribunto module has a well-defined function that returns a value, as an example here:

Scribunto module:

p = {}

function p.affiche_langues_python(frame)
-- returns the needed stuff here
end

The associated Python function:

def get_fr_languages():
    # https://fr.wiktionary.org/wiki/Module:langues/analyse
    json_text = expand_template(
        "fr", "{{#invoke:langues/analyse|affiche_langues_python}}"
    )
    json_text = json_text[json_text.index("{") : json_text.index("}") + 1]
    json_text = json_text.replace(",\r\n}", "}")  # remove tailing comma
    data = json.loads(json_text)
    lang_data = {}
    for lang_code, lang_name in data.items():
        lang_data[lang_code] = [lang_name[0].upper() + lang_name[1:]]

    save_json_file(lang_data, "fr")

But in our case we don't have a function to call. So if we try:

def get_cs_languages():
    # https://cs.wiktionary.org/wiki/Modul:Languages
    json_text = expand_template(
        "cs", "{{#invoke:Languages}}"
    )
    print(json_text)

we get <strong class="error"><span class="scribunto-error" id="mw-scribunto-error-0">Chyba skriptu: Musíte uvést funkci, která se má zavolat.</span></strong> usage: get_languages.py [-h] sub_domain lang_code get_languages.py: error: the following arguments are required: sub_domain, lang_code. (Translated as "You have to specify a function you want to call. But when you enter a function name as a parameter like in the French example, it complains that that function does not exist.)

What could be a way to solve this?


Solution

  • The easiest and most general way is to get the return value of the module as JSON and parse it in Python.

    Make another module that exports a function dump_as_json that takes the name of the first module as a frame argument and returns the first module as JSON. In Python, expand {{#invoke:json module|dump_as_json|Module:module to dump}} using the expandtemplates API and parse the return value of the module invocation as JSON with json.loads(data["expandtemplates"]["wikitext"]).

    Text of Module:json module (call it what you want):

    return {
        dump_as_json = function(frame)
            local module_name = frame.args[1]
            local json_encode = mw.text.jsonEncode
            -- json_encode = require "Module:JSON".toJSON
            return json_encode(require(module_name))
        end
    }
    

    With pywikibot:

    from pywikibot import Site
    site = Site(code="cs", fam="wiktionary")
    languages = json.loads(site.expand_text("{{#invoke:json module|dump_as_json|Module:module to dump}}")
    

    If you get the error Lua error: Cannot pass circular reference to PHP, this means that at least one of the tables in Module:module to dump is referenced by another table more than once, like if the module was

    local t = {}
    return { t, t }
    

    To handle these tables, you will have to get a pure-Lua JSON encoder function to replace mw.text.jsonEncode, like the toJSON function from Module:JSON on English Wiktionary.

    One warning about this method that is not relevant for the module you are trying to get: string values in the JSON will only be accurate if they were NFC-normalized valid UTF-8 with no special ASCII control codes (U+0000-U+001F excluding tab U+0009 and LF U+000A) when they were returned from Module:module to dump. As on a wiki page, the expandtemplates API will replace ASCII control codes and invalid UTF-8 with the U+FFFD character, and will NFC-normalize everything else. That is, "\1\128e" .. mw.ustring.char(0x0301) would be modified to the equivalent of mw.ustring.char(0xFFFD, 0xFFFD, 0x00E9). This doesn't matter in most cases (like if the table contains readable text), but if it did matter, the JSON-encoding module would have to output JSON escapes for non-NFC character sequences and ASCII control codes and find some way to encode invalid UTF-8.

    If, like the module you are dumping, Module:module to dump is a pure table of literal values with no references to other modules or to Scribunto-only global values, you could also get its raw wikitext with the Revisions API and parse it in Lua on your machine and pass it to Python. I think there is a Python extension that allows you to directly use a Lua state in Python.

    Running a module with dependencies on the local machine is not possible unless you go to the trouble of setting up the full Scribunto environment on your machine, and figuring out a way to download the module dependencies and make them available to the Lua state. I have sort of done this myself, but it isn't necessary for your use case.