Search code examples
gf

Records generate large size compiled files


I've created a record that contains different type of categories, but this caused GF to generate huge gfo files (150Mb) or even breaking down in some languages after compiling for more than 10 mins. My record only contains 9 elements (8 different categories).

Take the following small example, let's say I want to create a self-intro text using GF. In order to generate such text, I created a record and I named it Person. The Person record would contain different information like name, age, address, hobbies, human traits, etc. Just creating this small record made GF generate a 1.28 Mb gfo file, and it takes more than one second to compile it.

Here's my code

Abstract: test.gf

abstract Test = {
    flags
        startcat = Sentence;
    cat
        Sentence; Human;

    fun
        MySentence : Human -> Sentence; 
        Joan : Human;   
}

Concrete: TestSpa.gf

concrete TestSpa of Test = open SyntaxSpa, ParadigmsSpa, Predef, DictSpa in{
    lincat
        Sentence = Text;
        Human = Person;
    lin
        MySentence person =  generateIntro person;


        Joan = {name = "Joan" ; det = aSg_Det ; job = mkN ("Doctor") ; age = mkNumeral("22") ; hobby = bailar_V ; lastMeal = mkV2(hablar_V) ; 
                food = mkN ("spaghetti") ; prep = mkPrep ("at") ; location = mkN("resturant") ; meal = Lunch ; feeling = mkA("happy")};



    oper
        Person : Type = {name : Str ;
                            det : Det;
                            job : N ;
                            age : Numeral ;
                            hobby : V ;
                            lastMeal : V2 ;
                            food : N;
                            prep : Prep;
                            location : N;
                            meal : Meal;
                            feeling : A};

        generateIntro : Person -> Text =
            \per -> mkText (mkUtt (mkNP (mkDet(i_Pron)) (per.job)));

    param 
        Meal = Breakfast | Lunch | Dinner;

}

I noticed that some categories when added together in one record that will cause the run time to get longer, but others would not affect as much.

Question:

1- Is there a way other than using records to keep different categories values together, but not using abstract's dependent types?

2- Is there a way to solve this problem and let GF run fast knowing that these values in the record are all needed, but they will not appear in every sentence?


Solution

  • Yes, the second version is smaller, because the lincats have fewer fields each. The original was:

    Person : Type = {name : Str ;
                     det : Det;
                     job : N ;
                     age : Numeral ;
                     hobby : V ;
                     lastMeal : V2 ;
                     food : N;
                     prep : Prep;
                     location : N;
                     meal : Meal;
                     feeling : A};
    

    And in your new version, this very large lincat is broken up to fewer and smaller ones. That's why the PGF is smaller.

    Concrete categories

    First, a little background. I have a full blog post on the topic here, I recommend reading it if you are interested. But the short version is:

    • Every GF category compiles into potentially multiple concrete categories. The number of concrete categories depends on the number of inherent parameters. For instance:
    param
      Bool = True | False ;
    lincat
      MyCat = {s : Str ; b : Bool} ;
    

    The category MyCat becomes 2 concrete categories in the PGF: one where the b field is True, and other where it's False.

    Now look at the concrete categories of the RGL categories you use in your lincat of Person. The name field is just a Str, so that doesn't contribute. But next is det : Det. In the Romance languages in RGL, determiners have inherent parameters for number and whether they are negative: see the implementation here. So we get 4 for Det. Next we have N: Romance languages have inherent gender for nouns, so N compiles to 2 concrete categories. Let me just annotate the rest with the numbers:

    Person : Type = {name : Str ;    -- 1
                     det : Det;      -- 4
                     job : N ;       -- 2
                     age : Numeral ; -- 2
                     hobby : V ;     -- 2
                     lastMeal : V2 ; -- 16
                     food : N;       -- 2
                     prep : Prep;    -- 8
                     location : N;   -- 2
                     meal : Meal;    -- 3
                     feeling : A};   -- 8
    

    Here we get the multiplication 4 * 2 * 2 * 2 * 16 * 2 * 8 * 2 * 3 * 8, which equals 393,216 concrete categories for Person.

    In contrast, your new 4 types are much smaller:

    • Person1 : 1 * 4 * 2 = 8
    • Person2 : 2 * 2 * 16 = 64
    • Person3 : 2 * 8 * 2 = 32
    • Person4 : 3 * 8 = 24

    So altogether 8 + 64 + 32 + 24 = 128 concrete categories for the different Persons.

    My suggestion

    If I were writing this grammar, I would just let the grammar generate all combinations, and restrict them in an external programming language.

    The category of Person would just include the name, and all other things, like job, hobby and meal would be separate cats and funs. For example:

    cat
      Person ; Hobby ; Meal ; Sentence ;
    fun
      Joan, Maria : Person ;
      Singing, Dancing : Hobby ;
      Pizza, Dumplings : Meal ;
    
      hasHobby : Person -> Hobby -> Sentence ;
      favFood : Person -> Meal -> Sentence ;
    

    This grammar would construct all combinations of persons, hobbies and meals—maybe in real life only Joan sings and Maria dances, but the grammar generates also "Joan likes dancing" and "Maria likes singing".

    Benefits to this design are:

    • Each lincat is small, usually just a single RGL category, or maybe 1-2 custom params if needed. (E.g. stuff like Meal is not a RGL param, it's custom created for this application.)
    • Each function takes only a few arguments, so there are only few concrete functions in the PGF.

    This grammar generates more sentences than the restrictive versions in Yousef's posts, but the PGF is much smaller. That's just how GF works.

    So then in my external application, I would control which GF trees are generated and linearised. I may have a spreadsheet where I store people's favourite foods and hobbies, and from that spreadsheet I can just generate GF trees. For example:

    # people.csv
    Person,favFood,hasHobby
    "Maria","dumplings","dancing"
    "Joan","pizza","singing"
    

    Then write a script to generate GF trees:

    favFood Maria Dumplings
    hasHobby Maria Dancing
    favFood Joan Pizza
    hasHobby Joan Singing