Search code examples
pythonpython-3.xparsingnlptext-parsing

Text file parsing with python and with a list in grammar


I have to do a parsing: the goal is to create a grammar rules that will be applied in a corpus. I have a question: is it possible to have a list within a grammar?

Example:

1) Open the text to be analyzed
2) Write the grammatical rules (just an example):
   grammar("""
   S -> NP VP
   NP -> DET N
   VP -> V N
   DET -> list_det.txt
   N -> list_n.txt
   V -> list.txt""")
3) Print the result with the entries that obey this grammar

It's possible?


Solution

  • Here is a quick conceptual prototype of your grammar, using pyparsing. I could not tell from your question what the contents of the N, V, and DET lists could be, so I just arbitrarily chose words composed of 'n's and 'v's, and the literal 'det'. You can replace the <<= assignments with the correct expressions for your grammar, but this parser and the sample string should show that your grammar is at least feasible. (If you edit your question to show what N, V, and DET are lists of, I can update this answer with less arbitrary expressions and sample. Also including a sample string to be parsed would be useful.)

    I also added some grouping so that you could see how the structure of the grammar is reflected in the structure of the results. You can leave this in or removve it and the parser will still work.

    import pyparsing as pp
    
    v = pp.Forward()
    n = pp.Forward()
    det = pp.Forward()
    
    V = pp.Group(pp.OneOrMore(v))
    N = pp.Group(pp.OneOrMore(n))
    DET = pp.Group(pp.OneOrMore(det))
    
    VP = pp.Group(V + N)
    NP = pp.Group(DET + N)
    S = NP + VP
    
    # replace these with something meaningful
    v <<= pp.Word('v')
    n <<= pp.Word('n')
    det <<= pp.Literal('det')
    
    sample = 'det det nn nn nn nn vv vv vv nn nn nn nn'
    
    parsed = S.parseString(sample)
    print(parsed.asList())
    

    Prints:

    [[['det', 'det'], ['nn', 'nn', 'nn', 'nn']], 
     [['vv', 'vv', 'vv'], ['nn', 'nn', 'nn', 'nn']]]
    

    EDIT:

    I guessed the "NP" and "VP" are "noun phrase" and "verb phrase", but I don't know what "DET" could be. Still, I made up a less abstract example. I also expanded the lists to accept more grammatical forms of lists of nouns and verbs, with connecting 'and's and commas.

    import pyparsing as pp
    
    v = pp.Forward()
    n = pp.Forward()
    det = pp.Forward()
    
    def collectionOf(expr):
        '''
        Compose a collection expression for a base expression that matches
            expr
            expr and expr
            expr, expr, expr, and expr
        '''
        AND = pp.Literal('and')
        OR = pp.Literal('or')
        COMMA = pp.Suppress(',')
        return expr + pp.Optional(
                pp.Optional(pp.OneOrMore(COMMA + expr) + COMMA) + (AND | OR) + expr)
    
    V = pp.Group(collectionOf(v))('V')
    N = pp.Group(collectionOf(n))('N')
    DET = pp.Group(pp.OneOrMore(det))('DET')
    
    VP = pp.Group(V + N)('VP')
    NP = pp.Group(DET + N)('NP')
    S = pp.Group(NP + VP)('S')
    
    # replace these with something meaningful
    v <<= pp.Combine(pp.oneOf('chase love hate like eat drink') + pp.Optional(pp.Literal('s')))
    n <<= pp.Optional(pp.oneOf('the a my your our his her their')) + pp.oneOf("dog cat horse rabbit squirrel food water")
    det <<= pp.Optional(pp.oneOf('why how when where')) +pp.oneOf( 'do does did')
    
    samples = '''
        when does the dog eat the food
        does the dog like the cat
        do the horse, cat, and dog like or hate their food
        do the horse and dog love the cat
        why did the dog chase the squirrel
    '''
    S.runTests(samples)
    

    Prints:

    when does the dog eat the food
    [[[['when', 'does'], ['the', 'dog']], [['eat'], ['the', 'food']]]]
    - S: [[['when', 'does'], ['the', 'dog']], [['eat'], ['the', 'food']]]
      - NP: [['when', 'does'], ['the', 'dog']]
        - DET: ['when', 'does']
        - N: ['the', 'dog']
      - VP: [['eat'], ['the', 'food']]
        - N: ['the', 'food']
        - V: ['eat']
    
    
    does the dog like the cat
    [[[['does'], ['the', 'dog']], [['like'], ['the', 'cat']]]]
    - S: [[['does'], ['the', 'dog']], [['like'], ['the', 'cat']]]
      - NP: [['does'], ['the', 'dog']]
        - DET: ['does']
        - N: ['the', 'dog']
      - VP: [['like'], ['the', 'cat']]
        - N: ['the', 'cat']
        - V: ['like']
    
    
    do the horse, cat, and dog like or hate their food
    [[[['do'], ['the', 'horse', 'cat', 'and', 'dog']], [['like', 'or', 'hate'], ['their', 'food']]]]
    - S: [[['do'], ['the', 'horse', 'cat', 'and', 'dog']], [['like', 'or', 'hate'], ['their', 'food']]]
      - NP: [['do'], ['the', 'horse', 'cat', 'and', 'dog']]
        - DET: ['do']
        - N: ['the', 'horse', 'cat', 'and', 'dog']
      - VP: [['like', 'or', 'hate'], ['their', 'food']]
        - N: ['their', 'food']
        - V: ['like', 'or', 'hate']
    
    
    do the horse and dog love the cat
    [[[['do'], ['the', 'horse', 'and', 'dog']], [['love'], ['the', 'cat']]]]
    - S: [[['do'], ['the', 'horse', 'and', 'dog']], [['love'], ['the', 'cat']]]
      - NP: [['do'], ['the', 'horse', 'and', 'dog']]
        - DET: ['do']
        - N: ['the', 'horse', 'and', 'dog']
      - VP: [['love'], ['the', 'cat']]
        - N: ['the', 'cat']
        - V: ['love']
    
    
    why did the dog chase the squirrel
    [[[['why', 'did'], ['the', 'dog']], [['chase'], ['the', 'squirrel']]]]
    - S: [[['why', 'did'], ['the', 'dog']], [['chase'], ['the', 'squirrel']]]
      - NP: [['why', 'did'], ['the', 'dog']]
        - DET: ['why', 'did']
        - N: ['the', 'dog']
      - VP: [['chase'], ['the', 'squirrel']]
        - N: ['the', 'squirrel']
        - V: ['chase']