An ancient atomic simulation software is producing really ugly CSV file, which I want to import to pandas dataframe. The format looks like this:
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
491
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e+00 2.8000000000000000e+01
0.0000000000000000e+00 2.8000000000000000e+01
0.0000000000000000e+00 2.8000000000000000e+01
ITEM: ATOMS id type xs ys zs
1 1 0 0 0.142857
2 1 0.0714286 0.0714286 0.142857
3 1 0.0714286 0 0.214286
4 1 0 0.0714286 0.214286
...
491 1 2 2.3 0.4
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
491
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e+00 2.8000000000000000e+01
0.0000000000000000e+00 2.8000000000000000e+01
0.0000000000000000e+00 2.8000000000000000e+01
ITEM: ATOMS id type xs ys zs
1 1 0 0 0.142857
2 1 0.0714286 0.0714286 0.142857
3 1 0.0714286 0 0.214286
4 1 0 0.0714286 0.214286
...
491 215 0.4 12.4 2.4
...
...
ITEM: TIMESTEP
1002
...
Basically it is a repeating header with the information of the iteration number. To me it seems that the easiest way to turn this into pandas would be df with the variables ['id', 'type', 'xs', 'ys', 'zs'] and add to it a new column "TIMESTEP", so it would be a nice 2D df. Alternatively could be a multi-index array Timestep 1 -> internal_df['id', 'type', 'xs', 'ys', 'zs']
The information lines (1-9) can be deleted.
The end result would ideally look like this:
Index a b c d TIMESTEP
1 1 0 0 0.142857 0
2 1 0.0714286 0.0714286 0.142857 0
3 1 0.0714286 0 0.214286 0
4 1 0 0.0714286 0.214286 0
5 1 0.142857 0 0.142857 0
...
474 1 0.78636 0.788005 0.425791 100002
Would you suggest a string-formatting script (example would be appreciated), or maybe Pandas read_csv with a smart set of settings could do it out of the box?
EDITED: Added true INFORMATION bit of the header, which is to be discarded (bit from "Number of Entries" to "Variables" line)
With pandas, here is one way on how you can approach that (to give you just the general logic).
#pip install pandas
import pandas as pd
import numpy as np
df = pd.read_csv(StringIO(s), sep="/", header=None)
m1 = df[0].str.contains("TIMESTEP")
m2 = df[0].str.contains("Information").fillna(False)
m3 = df[0].str.contains("ITEM|Variables|\+", regex=True).fillna(False)
conds, vals = [m1|m1.shift(1).fillna(False), m2|m3], ["DATA", "TO_SKIP"]
out = (
df
.assign(flag= pd.Series(np.select(conds, vals, None)).bfill().ffill())
.pivot(columns="flag", values=0)
.loc[:, "DATA"].dropna()
.str.split(expand=True)
.assign(col= lambda x: x[0].shift(-1).where(x[1].str.contains("TIMESTEP")).ffill())
.set_axis(["Index", "a", "b", "c", "d", "TIMESTEP"], axis=1)
.dropna(how="any")
.reset_index(drop=True)
)
Output :
print(out)
Index a b c d TIMESTEP
0 1 1 0 0 0.142857 0
1 2 1 0.0714286 0.0714286 0.142857 0
2 3 1 0.0714286 0 0.214286 0
3 4 1 0 0.0714286 0.214286 0
4 491 1 2 2.3 0.4 0
5 1 1 0 0 0.142857 0
6 2 1 0.0714286 0.0714286 0.142857 0
7 3 1 0.0714286 0 0.214286 0
8 4 1 0 0.0714286 0.214286 0
9 491 215 0.4 12.4 2.4 0
Disclaimer: This approach may not be efficient for large files.