I have a lot of parse trees like this:
( S ( NP-SBJ ( PRP I ) ) ( INODE@S ( VP ( VBP have ) ( NP ( DT a ) ( INODE@NP ( NN savings ) ( NN account ) ) ) ) ( . . ) ) )
for a sentence like this: "I have a savings account ."
I need to extract all derivation rules from these trees. The derivation rules like:
S -> NP-SBJ INODE@S
NP-SBJ -> PRP
PRP -> I
INODE@S -> VP NP
and so on.
Is there any prepared code (preferably in java) or pseudo code for this purpose?
Edit:
I think this problem is very general and common in many areas. The simplified problem is to find each parent and it's children from a parenthesis tree.
I wrote this for python. I believe you can read it as a pseudocode. I will edit the post for Java later. I added the Java implementation later.
import re
# grammar repository
grammar_repo = []
s = "( S ( NP-SBJ ( PRP I ) ) ( INODE@S ( VP ( VBP have ) ( NP ( DT a ) ( INODE@NP ( NN savings ) ( NN account ) ) ) ) ( . . ) ) )"
# clean the string (make sure there is no double space in it)
s = s.replace(" ", " ").replace(" ", " ")
# find inner parenthesis (no-need-to-parse-grammar) or (lhs rhs) format
simple_grammars = re.findall("\([^\(\)]*\)", s)
# repeat until you cannot find any ( lhs rhs ) grammar
while len(simple_grammars) > 0:
# add all simple grammar into grammar repository
# replace them with their head
for simple_grammar in simple_grammars:
grammar = simple_grammar.split(" ")
# '(' == grammar[0] and ')' == grammar[-1]
lhs = grammar[1]
rhs = grammar[2:-1]
grammar_repo.append((lhs, rhs))
s = s.replace(simple_grammar, lhs)
simple_grammars = re.findall("\([^\(\)]*\)", s)
In short, start from simplest grammar you can find and replace them with their left-hand-side and continue. e.g. find (PRP I)
save it, then replace it with PRP
. repeat until you find all the syntax.
Update: The Java implementation is a bit different but it's the same idea. The full code is here: http://ideone.com/0eE8bd
PrintStream ps = new PrintStream(System.out);
ArrayList grammarRepo = new ArrayList();
String item, grammar;
String str = "( S ( NP-SBJ ( PRP I ) ) ( INODE@S ( VP ( VBP have ) ( NP ( DT a ) ( INODE@NP ( NN savings ) ( NN account ) ) ) ) ( . . ) ) )";
// cleanup double spaces
while (str.contains(" ")){
str = str.replaceAll(" ", " ");
}
// find no parenthesis zones!
Matcher m = Pattern.compile("\\([^\\(\\)]*\\)").matcher(str);
// loop until nothing left:
while (m.find()) {
item = m.group();
// make the grammar:
grammar = item.substring(1, item.length()-1).trim().replaceFirst(" ", " -> ");
if (!grammarRepo.contains(grammar)) {
grammarRepo.add(grammar);
ps.print(grammar + "\n");
}
str = str.replace(item, grammar.split(" -> ")[0]);
m = Pattern.compile("\\([^\\(\\)]*\\)").matcher(str);
}
the output:
PRP -> I
NP-SBJ -> PRP
VBP -> have
DT -> a
NN -> savings
NN -> account
INODE@NP -> NN NN
NP -> DT INODE@NP
VP -> VBP NP
. -> .
INODE@S -> VP .
S -> NP-SBJ INODE@S