I am trying to transform some data into a structured format and do a minor transformation. The source is a .csv file that is actually semi-structured that looks like this:
I would like the resulting data from output to look like this, and it is ok if the approach searches for the literal text 'Thing'
I am not partial to implementation but have looked all around petl and pandas with no success. How would you tackle this?
import pandas as pd
thing1 = pd.DataFrame([['A', 'B', 'C'], ['D', 'E', 'F'], ['G', 'H', 'I']])
thing2 = pd.DataFrame([['Z', 'Y', 'X'], ['A', 'V', 'U'], ['M', 'L', 'P']])
thing1.insert(0, 'label', value = 'Thing1')
thing2.insert(0, 'label', value = 'Thing2')
thing1.append(thing2)
Out[17]:
label 0 1 2
0 Thing1 A B C
1 Thing1 D E F
2 Thing1 G H I
0 Thing2 Z Y X
1 Thing2 A V U
2 Thing2 M L P
EDIT TO ADDRESS COMMENT
There may be a more elegant way to do this (note the addition of reset_index
here, which is necessary for subsequent slicing):
In [36]: thing3 = thing1.append(thing2).reset_index(drop = True)
In [37]: thing3
Out[37]:
label 0 1 2
0 Thing1 A B C # <-- slice from first 'A'
1 Thing1 D E F
2 Thing1 G H I
3 Thing2 Z Y X
4 Thing2 A V U # <-- to second 'A'
5 Thing2 M L P
In [38]: mask = thing3[0].between('A', 'A')
In [39]: mask
Out[39]:
0 True
1 False
2 False
3 False
4 True
5 False
Name: 0, dtype: bool
In [40]: thing3[mask[mask].index[0]: mask[mask].index[1]]
Out[40]:
label 0 1 2
0 Thing1 A B C
1 Thing1 D E F
2 Thing1 G H I
3 Thing2 Z Y X
Or, if you'd like to include the last row, just add 1 to the slice:
In [41]: thing3[mask[mask].index[0]: mask[mask].index[1] + 1]
Out[41]:
label 0 1 2
0 Thing1 A B C
1 Thing1 D E F
2 Thing1 G H I
3 Thing2 Z Y X
4 Thing2 A V U