Search code examples
pythondata-analysis

Parsing an irregular .dat file in Python


I have a .dat file of coordinates (x,y and z), separated by a marker (an integer). Here's a snippet of it:

500
0.14166    0.09077      0
0.11918    0.08461      0
0.09838    0.07771      0
0.07937    0.07022      0
0.06223    0.06222      0
0.04705    0.05386      0
0.03388    0.04528      0
0.02281    0.03663      0
0.01391    0.02808      0
42
0.00733    0.01969      0
0.00297    0.01152      0
0.01809    -0.01422     0
0.03068    -0.01687     0
0.14166    0.09077      0
0.11918    0.08461      0
0.09838    0.07771      0
0.07937    0.07022      0
42
0.14166    0.09077      0
0.11918    0.08461      0
0.09838    0.07771      0
0.07937    0.07022      0

What's the best way to separate it in chunks (preferably, one array per interval between markers)?

It's just a fraction of the data, in reality there are a few thousand points.


Solution

  • I would suggest to apply the power of pandas and numpy libraries.

    We start with loading the input file into dataframe with skipping the 1st row (skiprows=1) and explicitly specifying the number of columns via column names (names=['x','y','z']) meaning that marker lines will be treated as 1-column row with NaN values (like 42.00000 NaN NaN):

    import pandas as pd
    import numpy as np
    
    coords = pd.read_table('test.dat', delim_whitespace=True, header=None,
                           engine='python', skiprows=1, names=['x','y','z'])
    

    Then finding the positions of marker lines on which the coords dataframe will be splitted into chunks:

    na_markers = coords.loc[coords['y'].isna()].index
    

    Finally splitting and getting the needed numpy arrays:

    coords = [chunk.dropna().to_numpy() for chunk in np.split(coords, na_markers)]
    

    That's it, now coords contains a list of the needed coordinates "chunks":

    [array([[0.14166, 0.09077, 0.     ],
           [0.11918, 0.08461, 0.     ],
           [0.09838, 0.07771, 0.     ],
           [0.07937, 0.07022, 0.     ],
           [0.06223, 0.06222, 0.     ],
           [0.04705, 0.05386, 0.     ],
           [0.03388, 0.04528, 0.     ],
           [0.02281, 0.03663, 0.     ],
           [0.01391, 0.02808, 0.     ]]), array([[ 0.00733,  0.01969,  0.     ],
           [ 0.00297,  0.01152,  0.     ],
           [ 0.01809, -0.01422,  0.     ],
           [ 0.03068, -0.01687,  0.     ],
           [ 0.14166,  0.09077,  0.     ],
           [ 0.11918,  0.08461,  0.     ],
           [ 0.09838,  0.07771,  0.     ],
           [ 0.07937,  0.07022,  0.     ]]), array([[0.14166, 0.09077, 0.     ],
           [0.11918, 0.08461, 0.     ],
           [0.09838, 0.07771, 0.     ],
           [0.07937, 0.07022, 0.     ]])]