Search code examples
pythonpandasdataframeloopstext

loop multiple row data to a single row in dataframe


I am extracting data points from a multi line text file and and trying to add the group data to a single row in the dataframe, but I am getting each data point on its own row I would like to flatten this to 2 rows group1 and group2. I am new to python. Also if there is a more efficient way to do this that would be great too. I have tried groupby() but this did not seem to work? Thanks in advance.

import pandas as pd

data = """
Jan 2024
Group1 02/02/2024
dog 10 20
cat 21 32
Group2 05/02/2024
dog 23 45
cat 45 65
owl 24 12
monthly
Admin 02 22
clean 05 32
"""

extract = []
dog, cat, owl = [], [], []
for line in data.splitlines():
    a = c = e = ''
    # print(line)
    if 'Group' in line:
        group = line.rsplit()[0]
    
    if 'dog' in line or 'cat' in line or 'owl' in line:
        if line.startswith("dog"):
            dog, a, b = line.split()
        elif line.startswith("cat"):
            cat, c, d = line.split()
        elif line.startswith("owl"):
            owl, e, f = line.split()
        
        extract.append({
            'group': group,
            'dog': a,
            'cat': c,
            'owl': e
        })

df = pd.DataFrame(extract)
df = df[['group', 'dog', 'cat', 'owl']]
print(df)

Currently I am getting the following:

    group dog cat owl
0  Group1  10
1  Group1      21
2  Group2  23
3  Group2      45
4  Group2          24

What I would like is:

   group dog cat owl
0  Group1  10 21
1  Group2  23 45  24 

Solution

  • You can collapse rows before constructing a dataframe. This can be achieved by maintaining one dict column_name->value per group, refreshing it when a new group starts and adding it as a row right before that. Don't forget to add a row in the very end.

    extract = []
    row = None
    
    for line in data.splitlines():
        if 'Group' in line:
            if row is not None: # we have something to add
                extract.append(row)
            group = line.rsplit()[0]
            row = {'group': group} # new group starts - refreshing our dict
        
        if 'dog' in line or 'cat' in line or 'owl' in line:
            animal, val1, val2 = line.split()
            row[animal] = val1
            
    if row is not None: # a final group
        extract.append(row)
    
    df = pd.DataFrame(extract)
    df = df[['group', 'dog', 'cat', 'owl']]
    print(df)
    

    Output:

        group dog cat  owl
    0  Group1  10  21  NaN
    1  Group2  23  45   24