Search code examples
pythonparsingflat-file

Collapse multiple lines when reading flat file in python


I want to parse a flat file that looks like this in python;

  Element ID     Element Type     Result       Jacobian Sign    

============== ================= ========= =====================
      1            Parabolic      Warning          1.000000     
                  Hexahedron                                    
      2            Parabolic      Warning          1.000000     
                  Hexahedron                                    
      3            Parabolic      Warning          1.000000     
                  Hexahedron                                    
      4            Parabolic      Warning          1.000000     

I tried using the mechanism used in this answer as follows;

import pandas as pd

def parse_file(file):
    col_spec = [(0, 15), (16, 33), (34, 43), (44, 65)]
    return pd.read_fwf(file, colspecs=col_spec)

But it reads one record for the top row and one row which is empty apart from the word 'Hexahedron' as the element type.

>>> data = parse_file("example.txt")
>>> data.head()
       Element ID      Element Type    Result         Jacobian Sign
0             NaN               NaN       NaN                   NaN
1  ==============  ================  ========  ====================
2               1         Parabolic   Warning              1.000000
3             NaN        Hexahedron       NaN                   NaN <= Extra record
4               2         Parabolic   Warning              1.000000

As you can see from lines, the first two rows are captured as 2 records (records 2 and 3). I want the parser to capture the first two rows as one record, so that the phrase 'Parabolic Hexahedron' is captured as the element type. How can I do this?


Solution

  • some post-processing should do the trick. Here is some code using the shift operator. Also notice that opening the file is not required, just pass the filename to pd.read_fwf.

    import pandas as pd
    
    col_spec = [(0, 15), (15, 32), (32, 42), (43, 65)]
    df = pd.read_fwf("example.txt", colspecs=col_spec, comment="=")
    
    # combine rows
    df["combined"] = (df['Element Type'] + df['Element Type'].shift(-1)).where(df['Element ID'].notnull(), df['Element Type'] )
    # remove extra rows
    df = df[df['Element ID'].notnull()]
    

    this should give a DataFrame that looks like this:

      Element ID Element Type   Result Jacobian Sign             combined
    2          1    Parabolic  Warning      1.000000  ParabolicHexahedron
    4          2    Parabolic  Warning      1.000000  ParabolicHexahedron
    6          3    Parabolic  Warning      1.000000  ParabolicHexahedron
    8          4    Parabolic  Warning      1.000000  ParabolicHexahedron