Search code examples
pythonpyparsing

Parsing only some lines with pyparsing


I'm trying to parse a file, actually some portions of the file. The file contains information about hardwares in a server and each line starts with a keyword denoting the type of hardware. For example:

pci24 u2480-L0
fcs1 g4045-L1
pci25 h6045-L0
en192 v7024-L3
pci26 h6045-L1

Above example doesnt show a real file but it's simple and quite enough to demonstrate the need. I want only to parse the lines starting with "pci" and skip others. I wrote a grammer for lines starting with "pci":

grammar_pci = Group ( Word( "pci" + nums ) + Word( alphanums + "-" ) )

I've also wrote a grammar for lines not starting with "pci":

grammar_non_pci = Suppress( Regex( r"(?!pci)" ) )

And then build a grammar that sum up above two:

grammar = ( grammar_pci | grammar_non_pci )

Then i read the file and send it to parseString:

with open("foo.txt","r") as f:
  data = grammar.parseString(f.read())
print(data)

But no data is written as output. What am i missing? How to parse data skipping the lines not starts with a specific keyword?

Thanks.


Solution

  • You are off to a good start, but you are missing a few steps, mostly having to do with filling in gaps and repetition.

    First, look at your expression for grammar_non_pci:

    grammar_non_pci = Suppress( Regex( r"(?!pci)" ) )
    

    This correctly detects a line that does not start with "pci", but it doesn't actually parse the line's content.

    The easiest way to add this is to add a ".*" to the regex, so that it will parse not only the "not starting with pci" lookahead, but also the rest of the line.

    grammar_non_pci = Suppress( Regex( r"(?!pci).*" ) )
    

    Second, your grammar just processes a single instance of an input line.

    grammar = ( grammar_pci | grammar_non_pci )
    

    grammar needs to be repetitive

    grammar = OneOrMore( grammar_pci | grammar_non_pci, stopOn=StringEnd())
    
    [EDIT: since you are up to pyparsing 3.0.9, this can also be written as follows]
    grammar = (grammar_pci | grammar_non_pci)[1, ...: StringEnd()]
    

    Since grammar_non_pci could actually match on an empty string, it could repeat forever at the end of the file - that's why the stopOn argument is needed.

    With these changes, your sample text should parse correctly.

    But there is one issue that you'll need to clean up, and that is the definition of the "pci"-prefixed word in grammar_pci.

    grammar_pci = Group ( Word( "pci" + nums ) + Word( alphanums + "-" ) )
    

    Pyparsing's Word class takes 1 or 2 strings of characters, and uses them as a set of the valid characters for the initial word character and the body word characters. "pci" + nums gives the string "pci0123456789", and will match any word group using any of those characters. So it will match not only "pci00" but also "cip123", "cci123", "p0c0i", or "12345".

    To resolve this, use "pci" + Word(nums) wrapped in Combine to represent only word groups that start with "pci":

    grammar_pci = Group ( Combine("pci" + Word( nums )) + Word( alphanums + "-" ) )
    

    Since you seem comfortable using Regex items, you could also write this as

    grammar_pci = Group ( Regex(r"pci\d+") + Word( alphanums + "-" ) )
    

    These changes should get you moving forward on your parser.