Search code examples
pythonregexparsingconfigparsermoses

Parsing a Moses config file


Given a config file as such from the Moses Machine Translation Toolkit:

#########################
### MOSES CONFIG FILE ###
#########################

# input factors
[input-factors]
0

# mapping steps
[mapping]
0 T 0

[distortion-limit]
6

# feature functions
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/home/gillin/jojomert/phrase-jojo/work.src-ref/training/model/phrase-table.gz input-factor=0 output-factor=0
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/home/gillin/jojomert/phrase-jojo/work.src-ref/training/model/reordering-table.wbe-msd-bidirectional-fe.gz
Distortion
KENLM lazyken=0 name=LM0 factor=0 path=/home/gillin/jojomert/ru.kenlm order=5

# dense weights for feature functions
[weight]
UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
Distortion0= 0.3
LM0= 0.5

I need to read the parameters from the [weights] section:

UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
Distortion0= 0.3
LM0= 0.5

I have been doing it as such:

def read_params_from_moses_ini(mosesinifile):
    parameters_string = ""
    for line in reversed(open(mosesinifile, 'r').readlines()):
        if line.startswith('[weight]'):
            return parameters_string
        else:
            parameters_string+=line.strip() + ' ' 

to get this output:

LM0= 0.5 Distortion0= 0.3 LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3 TranslationModel0= 0.2 0.2 0.2 0.2 PhrasePenalty0= 0.2 WordPenalty0= -1 UnknownWordPenalty0= 1 

Then using parsing the output with

moses_param_pattern = re.compile(r'''([^\s=]+)=\s*((?:[^\s=]+(?:\s|$))*)''')

def parse_parameters(parameters_string):
    return dict((k, list(map(float, v.split())))
                   for k, v in moses_param_pattern.findall(parameters_string))


 mosesinifile = 'mertfiles/moses.ini'

 print (parse_parameters(read_params_from_moses_ini(mosesinifile)))

to get:

{'UnknownWordPenalty0': [1.0], 'PhrasePenalty0': [0.2], 'WordPenalty0': [-1.0], 'Distortion0': [0.3], 'LexicalReordering0': [0.3, 0.3, 0.3, 0.3, 0.3, 0.3], 'TranslationModel0': [0.2, 0.2, 0.2, 0.2], 'LM0': [0.5]}

The current solution involve some crazy reversal line reading from the config file and then pretty complicated regex reading to get the parameters.

Is there a simpler or less hacky/verbose way to read the file and achieve the desired parameter dictionary output?

Is it possible to change the configparser such that it reads the moses config file? It's pretty hard because it has some erroneous section that are actually parameters, e.g. [distortion-limit] and there's no key to the value 6. In a validated configparse-able file, it would have been distortion-limit = 6.


Note: The native python configparser is unable to handle a moses.ini config file. Answers from How to read and write INI file with Python3? will not work.


Solution

  • Here is another short regex-based solution that returns a dictionary of the values similar to your output:

    import re
    from collections import defaultdict
    
    dct = {}
    
    str="MOSES_INI_FILE_CONTENTS"
    
    #get [weight] section
    match_weight = re.search(r"\[weight][^\n]*(?:\n(?!$|\n)[^\n]*)*", str) # Regex is identical to "(?s)\[weight].*?(?:$|\n\n)"
    if match_weight:
        weight = match_weight.group() # get the [weight] text
        dct = dict([(x[0], [float(x) for x in x[1].split(" ")]) for x in  re.findall(r"(\w+)\s*=\s*(.*)\s*", weight)])
    
    print dct
    

    See IDEONE demo

    The resulting dictionary contents:

    {'UnknownWordPenalty0': [1.0], 'LexicalReordering0': [0.3, 0.3, 0.3, 0.3, 0.3, 0.3], 'LM0': [0.5], 'PhrasePenalty0': [0.2], 'TranslationModel0': [0.2, 0.2, 0.2, 0.2], 'Distortion0': [0.3], 'WordPenalty0': [-1.0]}
    

    The logic:

    • Get the [weight] block out of the file. It can be done with a r"\[weight][^\n]*(?:\n(?!$|\n)[^\n]*)*" regex that matches [weight] literally, then it matches every character any number of times until a double \n symbol (the regex is using an unroll the loop technique and is good with longer texts spanning several lines). The identical lazy-dot-based regex is [r"(?s)\[weight].*?(?:$|\n\n)"] but it is not efficient (62 steps with the first regex and 528 with this second regex to find the match in the current MOSES.ini file), but is definitely more readable.
    • Once you have run the search, check for the match. If match is found, run the re.findall(r"(\w+)\s*=\s*(.*)\s*", weight) method to collect all key-value pairs. The regex used is a simple (\w+)\s*=\s*(.*)\s* matching and capturing into Group 1 one or more alphanumeric symbols ((\w+)) followed by any amount of spaces, =, again any amount of spaces (\s*=\s*), and then matching and capturing into Group 2 any symbols but a newline up to the end of string. Trailing newlines with subsequent sapces are trimmed with the final \s*.
    • When collecting the keys and values, the latter can be returned as lists of numbers parsed as float values using comprehension.