Search code examples
pythonpandasparsingtext-parsing

How do I get the position of characters in a string according to the previous line/string in Python?


I have to parse through a file that has data I need to put/sort in a pandas dataframe. Below is an example of part of file I parse through:

TEST#          RESULT    UNITS          LOWER          UPPER ALARM TEST NAME
------ -------------- -------- -------------- -------------- ----- ----------
1.0              1234                                              TESTNAME1
1.1           H647333                                              TESTNAME2
1.2                30        C                                     TEMPTOTAL
1.3                 1      cnt                                     CEREAL
1.4            364003      cnt                                     POINTNUM
1.5          20200505      cnt                                     Date
1.6            174143      cnt                                     Time
1.7     2.020051e+007      cnt                                     DateTime
1.8               123      cnt                                     SMT
1.9             23.16        C                                     TEMP1
1.10            23.55        C                           123       TEMP2
1.11            22.88        C            -23                      TEMP3
1.12            22.86        C                                     TEMP4
1.13            1.406    Meter         -1.450          1.500       DIST1
1.14            0.718    Meter         -0.800          0.350  FAIL DIST2

My issue is: How do I account for having a low limit but no upper limit OR having a upper limit but no low limit?

NOTE: My actual text file does not have this case but my application\project calls to account for the instance where it could happen.

How I check each line is below:

 line = file_object.readline()
 while line.strip():
    # extract data from line and format all info in one list
    xline = line.strip().split()

    # the length of the info list of the line read
    # is correlated to the data

    if len(xline) == 3:
        number = xline[0]
        results = xline[1]
        testname = xline[2]
        units = None
        lower = None
        upper = None
        # alarm = None

    elif len(xline) == 4:
        number = xline[0]
        results = xline[1]
        units = xline[2]
        testname = xline[3]
        lower = None
        upper = None
        # alarm = None

    elif len(xline) == 6:
        number = xline[0]
        results = xline[1]
        units = xline[2]
        lower = xline[3]
        upper = xline[4]
        testname = xline[5]
        # alarm = None

    elif len(xline) == 7:
        number = xline[0]
        results = xline[1]
        units = xline[2]
        lower = xline[3]
        upper = xline[4]
        # alarm = xline[5]
        testname = xline[6]

    # create a dictionary containing this row of data
    row = {
        'Test #': number,
        'Result': results,
        'Units': units,
        'Lower Limit': lower,
        'Upper Limit': upper,
        # 'Alarm': alarm,
        'Test Name': testname,
    }
    data.append(row)
    line = file_object.readline()

My idea is that I compare each line read of data to the "TEST# RESULT UNITS LOWER UPPER ALARM TEST NAME" line header positions, but I have no idea on how to do that. If anyone could point me in a direction that could work that would be great!

EDIT: The file is not solely in the table format shown above. My file has a whole bunch of staggered block text at the start of the file. As well as multiple "tables" with staggered block text between them.


Solution

  • A non-pandas solution that infers field widths from the header, but use pandas 🙂:

    import re
    
    with open('table.txt') as fin:
        next(fin) # skip headers
        # capture start/end of each set of dashed lines to get field widths
        spans = [m.span() for m in re.finditer(r'-+',next(fin))]
        for line in fin:
            # break lines on the field widths and strip leading/trailing white sapce
            column = [line[start:end].strip() for start,end in spans]
            print(column)
    

    Output:

    ['1.0', '1234', '', '', '', '', 'TESTNAME1']
    ['1.1', 'H647333', '', '', '', '', 'TESTNAME2']
    ['1.2', '30', 'C', '', '', '', 'TEMPTOTAL']
    ['1.3', '1', 'cnt', '', '', '', 'CEREAL']
    ['1.4', '364003', 'cnt', '', '', '', 'POINTNUM']
    ['1.5', '20200505', 'cnt', '', '', '', 'Date']
    ['1.6', '174143', 'cnt', '', '', '', 'Time']
    ['1.7', '2.020051e+007', 'cnt', '', '', '', 'DateTime']
    ['1.8', '123', 'cnt', '', '', '', 'SMT']
    ['1.9', '23.16', 'C', '', '', '', 'TEMP1']
    ['1.10', '23.55', 'C', '', '123', '', 'TEMP2']
    ['1.11', '22.88', 'C', '-23', '', '', 'TEMP3']
    ['1.12', '22.86', 'C', '', '', '', 'TEMP4']
    ['1.13', '1.406', 'Meter', '-1.450', '1.500', '', 'DIST1']
    ['1.14', '0.718', 'Meter', '-0.800', '0.350', 'FAIL', 'DIST2']