Search code examples
pythonpandascsvstring-formatting

Pandas magic with ugly CSV format


An ancient atomic simulation software is producing really ugly CSV file, which I want to import to pandas dataframe. The format looks like this:

ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
491
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e+00 2.8000000000000000e+01
0.0000000000000000e+00 2.8000000000000000e+01
0.0000000000000000e+00 2.8000000000000000e+01
ITEM: ATOMS id type xs ys zs
1 1 0 0 0.142857
2 1 0.0714286 0.0714286 0.142857
3 1 0.0714286 0 0.214286
4 1 0 0.0714286 0.214286
...
491 1 2 2.3 0.4
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
491
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e+00 2.8000000000000000e+01
0.0000000000000000e+00 2.8000000000000000e+01
0.0000000000000000e+00 2.8000000000000000e+01
ITEM: ATOMS id type xs ys zs
1 1 0 0 0.142857
2 1 0.0714286 0.0714286 0.142857
3 1 0.0714286 0 0.214286
4 1 0 0.0714286 0.214286
...
491 215 0.4 12.4 2.4
...
...
ITEM: TIMESTEP
1002
...

Basically it is a repeating header with the information of the iteration number. To me it seems that the easiest way to turn this into pandas would be df with the variables ['id', 'type', 'xs', 'ys', 'zs'] and add to it a new column "TIMESTEP", so it would be a nice 2D df. Alternatively could be a multi-index array Timestep 1 -> internal_df['id', 'type', 'xs', 'ys', 'zs']

The information lines (1-9) can be deleted.

The end result would ideally look like this:

    Index   a   b           c           d           TIMESTEP
    1       1   0           0           0.142857    0
    2       1   0.0714286   0.0714286   0.142857    0
    3       1   0.0714286   0           0.214286    0
    4       1   0           0.0714286   0.214286    0
    5       1   0.142857    0           0.142857    0
    ...
    474     1   0.78636     0.788005    0.425791    100002

Would you suggest a string-formatting script (example would be appreciated), or maybe Pandas read_csv with a smart set of settings could do it out of the box?

EDITED: Added true INFORMATION bit of the header, which is to be discarded (bit from "Number of Entries" to "Variables" line)


Solution

  • With , here is one way on how you can approach that (to give you just the general logic).

    #pip install pandas
    import pandas as pd 
    import numpy as np
    ​
    df = pd.read_csv(StringIO(s), sep="/", header=None)
    ​
    m1 = df[0].str.contains("TIMESTEP")
    m2 = df[0].str.contains("Information").fillna(False)
    m3 = df[0].str.contains("ITEM|Variables|\+", regex=True).fillna(False)
    ​
    conds, vals = [m1|m1.shift(1).fillna(False), m2|m3], ["DATA", "TO_SKIP"]
    ​
    out = (
            df
              .assign(flag= pd.Series(np.select(conds, vals, None)).bfill().ffill())
              .pivot(columns="flag", values=0)
              .loc[:, "DATA"].dropna()
              .str.split(expand=True)
              .assign(col= lambda x: x[0].shift(-1).where(x[1].str.contains("TIMESTEP")).ffill())
              .set_axis(["Index", "a", "b", "c", "d", "TIMESTEP"], axis=1)
              .dropna(how="any")
              .reset_index(drop=True)
            )
    ​
    

    Output :

    print(out)
    
      Index    a          b          c         d TIMESTEP
    0     1    1          0          0  0.142857        0
    1     2    1  0.0714286  0.0714286  0.142857        0
    2     3    1  0.0714286          0  0.214286        0
    3     4    1          0  0.0714286  0.214286        0
    4   491    1          2        2.3       0.4        0
    5     1    1          0          0  0.142857        0
    6     2    1  0.0714286  0.0714286  0.142857        0
    7     3    1  0.0714286          0  0.214286        0
    8     4    1          0  0.0714286  0.214286        0
    9   491  215        0.4       12.4       2.4        0
    

    Disclaimer: This approach may not be efficient for large files.