Search code examples
treeparse-tree

Grouping of CFG grammar rules sentencewise


Below specified rules are generated for each sentence. We have to group them for each sentence. The input is in file. Output also should be in file

sentenceid=2

NP--->N_NNP
NP--->N_NN_S_NU
NP--->N_NNP
NP--->N_NNP
NP--->N_NN_O_NU
VGF--->V_VM_VF

sentenceid=3

NP--->N_NN
VGNF--->V_VM_VNF
JJP--->JJ
NP--->N_NN_S_NU
NP--->N_NN
VGF--->V_VM_VF

sentenceid=4

NP--->N_NNP
NP--->N_NN_S_NU
NP--->N_NNP_O_M
VGF--->V_VM_VF

The above section containing input ,that is actually grammar for each sentence. I want to group adjacent rules sentence wise. Output should be like below.

sentenceid=2

NP--->N_NNP N_NN_S_NU N_NNP N_NNP N_NN_O_NU
VGF--->V_VM_VF

sentenceid=3

NP--->N_NN
VGNF--->V_VM_VNF
JJP--->JJ
NP--->N_NN_S_NU N_NN
VGF--->V_VM_VF

senetnceid=4

NP--->N_NNP N_NN_S_NU N_NNP_O_M
VGF--->V_VM_VF

How can I implement this? I need almost 1000 sentences rules for probability calculation. This is the CFG grammar for each sentence, I want to group adjacent rules sentence-wise.


Solution

  • How about this: considering sentence are in different files.

    #!/usr/bin/python
    
    import re
    marker = '--->'
    
    def parse_it(sen):
        total_dic = dict()
        marker_memory = ''
        with open(sen, 'r') as fh:
            mem = None
            lo = list()
            for line in fh.readlines():
                if line.strip():
                    match = re.search('(sentenceid=\d+)', line)
                    if match:
                        if mem and lo:
                            total_dic[marker_memory].append(lo)
                        marker_memory = match.group(0)
                        total_dic[marker_memory] = []
                    else:
                        k,v = line.strip().split(marker)
                        k,v = [ x.strip() for x in [k,v]]
                        if not mem or mem == k:
                            lo.append((k,v))
                            mem = k
                        else:
                            total_dic[marker_memory].append(lo)
                            lo = [(k,v)]
                            mem = k
            #total_dic[marker_memory].append(lo)
        return total_dic
    
    dic = parse_it('sentence')
    for kin,lol in dic.iteritems():
        print
        print kin
        for i in lol:
            k,v = zip(*i)
            print '%s%s %s' % (k[0],marker,' '.join(v))
    

    Output:

    sentenceid=3
    VGF---> V_VM_VF
    NP---> N_NN
    VGNF---> V_VM_VNF
    JJP---> JJ
    NP---> N_NN_S_NU N_NN
    VGF---> V_VM_VF
    
    sentenceid=2
    NP---> N_NNP N_NN_S_NU N_NNP N_NNP N_NN_O_NU
    VGF---> V_VM_VF
    
    sentenceid=4
    VGF---> V_VM_VF
    NP---> N_NNP N_NN_S_NU N_NNP_O_M