Search code examples
pythonstringpyparsing

Parse a colon separated string with pyparsing


This is the data:

C:/data/my_file.txt.c:10:0x21:name1:name2:0x10:1:OK
C:/data/my_file2.txt.c:110:0x1:name2:name5:0x12:1:NOT_OK
./data/my_file3.txt.c:110:0x1:name2:name5:0x12:10:OK

And I would like to get this result

[C:/data/my_file.txt.c, 10, 0x21, name1, name2, 0x10, 1, OK]
[C:/data/my_file2.txt.c, 110, 0x1, name2, name5, 0x12, 1, NOT_OK]
[./data/my_file3.txt.c, 110, 0x1, name2, name5, 0x12, 10, OK]

I know how to do that with some code or string split and stuff like that, but I am searching for a nice solution using pyparsing. My problem is the :/ for the file path.

Additional Question I use some code to strip comments and other stuff from the records so the raw data looks like this:

text = """C:/data/my_file.txt.c:10:0x21:name1:name2:0x10:1:OK
C:/data/my_file2.txt.c:110:0x1:name2:name5:0x12:1:NOT_OK
// comment
./data/my_file3.txt.c:110:0x1:name2:name5:0x12:10:OK
---- 
ok
"""

And i strip the "//", "ok", and "---" before parsing right now

So now I have a next question too the first:

Some addition to the first question. Till now I extracted the lines above from a data file - that works great. So I read the file line by line and parse it. But now I found out it is possible to use parseFile to parse a whole file. So I think I could strip some of my code and use parseFile instead. So the files I would like to parse have an additional footer.

C:/data/my_file.txt.c:10:0x21:name1:name2:0x10:1:OK
C:/data/my_file2.txt.c:110:0x1:name2:name5:0x12:1:NOT_OK
./data/my_file3.txt.c:110:0x1:name2:name5:0x12:10:OK: info message

-----------------------
3 Files 2 OK 1 NOT_OK
NOT_OK 

Is it possible to change the parser to get 2 parse results?

Result1:

[['C:/data/my_file.txt.c', '10', '0x21', 'name1', 'name2', '0x10', '1', 'OK'],
 ['C:/data/my_file2.txt.c', '110', '0x1', 'name2', 'name5', '0x12', '1', 'NOT_OK'],
 ['./data/my_file3.txt.c', '110', '0x1', 'name2', 'name5', '0x12', '10', 'OK']]

Ignore the blank line   
Ignore this line => -----------------------

Result 2:

 [['3', 'Files', 2', 'OK’, '1', 'NOT_OK'],
 ['NOT_OK’],

So I changed the thes Code for that:

    # define an expression for your file reference
one_thing = Combine(
    oneOf(list(alphas)) + ':/' +
    Word(alphanums + '_-./'))

# define a catchall expression for everything else (words of non-whitespace characters,
# excluding ':')
another_thing = Word(printables + " ", excludeChars=':')

# define an expression of the two; be sure to list the file reference first
thing = one_thing | another_thing

# now use plain old pyparsing delimitedList, with ':' delimiter
list_of_things = delimitedList(thing, delim=':')

list_of_other_things = Word(printables).setName('a')
# run it and see...
parse_ret = OneOrMore(Group(list_of_things | list_of_other_things)).parseFile("data.file")
parse_ret.pprint()

And I get this result:

[['C:/data/my_file.txt.c', '10', '0x21', 'name1', 'name2', '0x10', '1', 'OK'],
['C:/data/my_file2.txt.c','110', '0x1', 'name2', 'name5', '0x12', '1', 'NOT_OK'],
['./data/my_file3.txt.c', '110', '0x1', 'name2', 'name5', '0x12', '10', 'OK', 'info message'],
['-----------------------'],
['3 Files 2 OK 1 NOT_OK'],
['NOT_OK']]

So I can go with this but is it possible to split the result into two named results? I searched the docs but I didn´t find anything that works.


Solution

  • So I didn´t found a solution with delimitedList and parseFile but I found a Solution which is okay for me.

    from pyparsing import *
    
    data = """
    C: / data / my_file.txt.c:10:0x21:name1:name2:0x10:1:OK
    C: / data / my_file2.txt.c:110:0x1:name2:name5:0x12:1:NOT_OK
    ./ data / my_file3.txt.c:110:0x1:name2:name5:0x12:10:OK: info message
    
    -----------------------
    3 Files 2 OK 1 NOT_OK
    NOT_OK
    """
    
    if __name__ == '__main__':
    
    # define an expression for your file reference
    entry_one = Combine(
        oneOf(list(alphas)) + ':/' +
        Word(alphanums + '_-./'))
    
    entry_two = Word(printables + ' ', excludeChars=':')
    entry = entry_one | entry_two
    
    delimiter = Literal(':').suppress()
    tc_result_line = Group(entry.setResultsName('file_name') + delimiter + entry.setResultsName(
        'line_nr') + delimiter + entry.setResultsName('num_one') + delimiter + entry.setResultsName('name_one') + delimiter + entry.setResultsName(
        'name_two') + delimiter + entry.setResultsName('num_two') + delimiter + entry.setResultsName('status') + Optional(
        delimiter + entry.setResultsName('msg'))).setResultsName("info_line")
    
    EOL = LineEnd().suppress()
    SOL = LineStart().suppress()
    blank_line = SOL + EOL
    
    tc_summary_line = Group(Word(nums).setResultsName("num_of_lines") + "Files" + Word(nums).setResultsName(
        "num_of_ok") + "OK" + Word(nums).setResultsName("num_of_not_ok") + "NOT_OK").setResultsName(
        "info_summary")
    tc_end_line = Or(Literal("NOT_OK"), Literal('Ok')).setResultsName("info_result")
    
    # run it and see...
    pp1 = tc_result_line | Optional(tc_summary_line | tc_end_line)
    pp1.ignore(blank_line | OneOrMore("-"))
    
    result = list()
    for l in data.split('\n'):
        result.append((pp1.parseString(l)).asDict())
    # delete empty results
    result = filter(None, result)
    
    for r in result:
        print(r)
    
    pass
    

    Result:

    {'info_line': {'file_name': 'C', 'num_one': '10', 'msg': '1', 'name_one':   '0x21', 'line_nr': '/ data / my_file.txt.c', 'status': '0x10', 'num_two': 'name2', 'name_two': 'name1'}}
    {'info_line': {'file_name': 'C', 'num_one': '110', 'msg': '1', 'name_one': '0x1', 'line_nr': '/ data / my_file2.txt.c', 'status': '0x12', 'num_two': 'name5', 'name_two': 'name2'}}
    {'info_line': {'file_name': './ data / my_file3.txt.c', 'num_one': '0x1', 'msg': 'OK', 'name_one': 'name2', 'line_nr': '110', 'status': '10', 'num_two': '0x12', 'name_two': 'name5'}}
    {'info_summary': {'num_of_lines': '3', 'num_of_ok': '2', 'num_of_not_ok': '1'}}
    {'info_result': ['NOT_OK']}