Search code examples
pythonpandasindexinggroupingmulti-level

Python: creating separate columns for group labels and individual units when both are in the same column of the original data


I am new to Python and I suspect this question might have been asked before, but I really could not find any relevant threads. Hence, the question.

I intend to create a data frame containing separate columns for "group labels" and "individual units," but the original raw data were structured like this:

import pandas as pd
df = pd.read_table('df.txt',  names = 'data')

which reads

        data
0 group1
1 unit_a
2 unit_b
3 group2
4 unit_c
5 unit_d
6 unit_e
7 group3
8 unit_f
9 unit_g
...

but I want to create a data frame like this

       Group   Unit
    0 group1 unit_a
    1 group1 unit_b
    2 group2 unit_c
    3 group2 unit_d
    4 group2 unit_e
    5 group3 unit_f
    6 group3 unit_g
    ...

That is, in the original data, units are listed under their respective group labels and they are all listed in the same column, while each group may contain different number of units. So I have a hard time slicing data using intervals or other methods. How can I sort or slice the original data into my desired two-column data frame-type output?

Thanks in advance.


Solution

  • A very rudimentary approach would be to just assign a new column with the masked values using str.contains, filter the rows where the columns matched (e.g. group1, group1), and then reset the index.

    df = (df.assign(group=df[df.data.str.contains('group')])
            .ffill()
            .query('data != group')
            .reset_index(drop=True)
            .rename(columns={'data':'unit'}))
    

    I can't help but feel there is an easier way to go about this though.

    Demo

    >>>> (df.assign(group=df[df.data.str.contains('group')])
            .ffill()
            .query('data != group')
            .reset_index(drop=True))  
            .rename(columns={'data':'unit'}))
    
         unit   group
    0  unit_a  group1
    1  unit_b  group1
    2  unit_c  group2
    3  unit_d  group2
    4  unit_e  group2
    5  unit_f  group3
    6  unit_g  group3