Search code examples
javaparsingrecursionnlppseudocode

How to extract derivation rules from a bracketed parse tree?


I have a lot of parse trees like this:

( S ( NP-SBJ ( PRP I  )  )  ( INODE@S ( VP ( VBP have  )  ( NP ( DT a  )  ( INODE@NP ( NN savings  )  ( NN account  )  )  )  )  ( . .  )  )  )

for a sentence like this: "I have a savings account ."

I need to extract all derivation rules from these trees. The derivation rules like:

S -> NP-SBJ INODE@S
NP-SBJ -> PRP 
PRP -> I
INODE@S -> VP NP
and so on.

Is there any prepared code (preferably in java) or pseudo code for this purpose?

Edit:

I think this problem is very general and common in many areas. The simplified problem is to find each parent and it's children from a parenthesis tree.


Solution

  • I wrote this for python. I believe you can read it as a pseudocode. I will edit the post for Java later. I added the Java implementation later.

    import re
    
    # grammar repository 
    grammar_repo = []
    
    s = "( S ( NP-SBJ ( PRP I  )  )  ( INODE@S ( VP ( VBP have  )  ( NP ( DT a  )  ( INODE@NP ( NN savings  )  ( NN account  )  )  )  )  ( . . )  )  )"
    # clean the string (make sure there is no double space in it)
    s = s.replace("   ", " ").replace("  ", " ")
    
    # find inner parenthesis (no-need-to-parse-grammar) or (lhs rhs) format
    simple_grammars = re.findall("\([^\(\)]*\)", s)
    # repeat until you cannot find any ( lhs rhs ) grammar
    while len(simple_grammars) > 0:
        # add all simple grammar into grammar repository
        # replace them with their head
        for simple_grammar in simple_grammars:
            grammar = simple_grammar.split(" ")
            # '(' == grammar[0] and ')' == grammar[-1]
            lhs = grammar[1]
            rhs = grammar[2:-1]
            grammar_repo.append((lhs, rhs))
    
            s = s.replace(simple_grammar, lhs)
    
        simple_grammars = re.findall("\([^\(\)]*\)", s)
    

    In short, start from simplest grammar you can find and replace them with their left-hand-side and continue. e.g. find (PRP I) save it, then replace it with PRP. repeat until you find all the syntax.

    Update: The Java implementation is a bit different but it's the same idea. The full code is here: http://ideone.com/0eE8bd

    PrintStream ps = new PrintStream(System.out);
    ArrayList grammarRepo = new ArrayList();
    String item, grammar;
    String str = "( S ( NP-SBJ ( PRP I  )  )  ( INODE@S ( VP ( VBP have  )  ( NP ( DT a  )  ( INODE@NP ( NN savings  )  ( NN account  )  )  )  )  ( . . )  )  )";
    // cleanup double spaces
    while (str.contains("  ")){
        str = str.replaceAll("  ", " ");
    }
    // find no parenthesis zones!
    Matcher m = Pattern.compile("\\([^\\(\\)]*\\)").matcher(str);
    
    // loop until nothing left:
    while (m.find()) {
        item = m.group();
        // make the grammar:
        grammar = item.substring(1, item.length()-1).trim().replaceFirst(" ", " -> ");
    
        if (!grammarRepo.contains(grammar)) {
            grammarRepo.add(grammar);
            ps.print(grammar + "\n");
        }
    
        str = str.replace(item, grammar.split(" -> ")[0]);
        m = Pattern.compile("\\([^\\(\\)]*\\)").matcher(str);
    }
    

    the output:

    PRP -> I
    NP-SBJ -> PRP
    VBP -> have
    DT -> a
    NN -> savings
    NN -> account
    INODE@NP -> NN NN
    NP -> DT INODE@NP
    VP -> VBP NP
    . -> .
    INODE@S -> VP .
    S -> NP-SBJ INODE@S