I am new to Python and I suspect this question might have been asked before, but I really could not find any relevant threads. Hence, the question.
I intend to create a data frame containing separate columns for "group labels" and "individual units," but the original raw data were structured like this:
import pandas as pd
df = pd.read_table('df.txt', names = 'data')
which reads
data
0 group1
1 unit_a
2 unit_b
3 group2
4 unit_c
5 unit_d
6 unit_e
7 group3
8 unit_f
9 unit_g
...
but I want to create a data frame like this
Group Unit
0 group1 unit_a
1 group1 unit_b
2 group2 unit_c
3 group2 unit_d
4 group2 unit_e
5 group3 unit_f
6 group3 unit_g
...
That is, in the original data, units are listed under their respective group labels and they are all listed in the same column, while each group may contain different number of units. So I have a hard time slicing data using intervals or other methods. How can I sort or slice the original data into my desired two-column data frame-type output?
Thanks in advance.
A very rudimentary approach would be to just assign a new column with the masked values using str.contains
, filter the rows where the columns matched (e.g. group1, group1
), and then reset the index.
df = (df.assign(group=df[df.data.str.contains('group')])
.ffill()
.query('data != group')
.reset_index(drop=True)
.rename(columns={'data':'unit'}))
I can't help but feel there is an easier way to go about this though.
Demo
>>>> (df.assign(group=df[df.data.str.contains('group')])
.ffill()
.query('data != group')
.reset_index(drop=True))
.rename(columns={'data':'unit'}))
unit group
0 unit_a group1
1 unit_b group1
2 unit_c group2
3 unit_d group2
4 unit_e group2
5 unit_f group3
6 unit_g group3