Search code examples
pythonpandasdata-structurespetl

Using Python with semi-structured data, how to add a column value based on text encountered in preceding row


I am trying to transform some data into a structured format and do a minor transformation. The source is a .csv file that is actually semi-structured that looks like this:

enter image description here

I would like the resulting data from output to look like this, and it is ok if the approach searches for the literal text 'Thing'

enter image description here

I am not partial to implementation but have looked all around petl and pandas with no success. How would you tackle this?


Solution

  • import pandas as pd  
      
    thing1 = pd.DataFrame([['A', 'B', 'C'], ['D', 'E', 'F'], ['G', 'H', 'I']])
    thing2 = pd.DataFrame([['Z', 'Y', 'X'], ['A', 'V', 'U'], ['M', 'L', 'P']])
    
    thing1.insert(0, 'label', value = 'Thing1')
    thing2.insert(0, 'label', value = 'Thing2')
    
    thing1.append(thing2)
    
    Out[17]: 
        label  0  1  2
    0  Thing1  A  B  C
    1  Thing1  D  E  F
    2  Thing1  G  H  I
    0  Thing2  Z  Y  X
    1  Thing2  A  V  U
    2  Thing2  M  L  P
    

    EDIT TO ADDRESS COMMENT

    There may be a more elegant way to do this (note the addition of reset_index here, which is necessary for subsequent slicing):

    In [36]: thing3 = thing1.append(thing2).reset_index(drop = True)
    
    In [37]: thing3
    Out[37]: 
        label  0  1  2
    0  Thing1  A  B  C # <-- slice from first 'A'
    1  Thing1  D  E  F
    2  Thing1  G  H  I
    3  Thing2  Z  Y  X
    4  Thing2  A  V  U # <-- to second 'A'
    5  Thing2  M  L  P
    
    In [38]: mask = thing3[0].between('A', 'A')
    
    In [39]: mask
    Out[39]: 
    0     True
    1    False
    2    False
    3    False
    4     True
    5    False
    Name: 0, dtype: bool
    
    In [40]: thing3[mask[mask].index[0]: mask[mask].index[1]]
    Out[40]: 
        label  0  1  2
    0  Thing1  A  B  C
    1  Thing1  D  E  F
    2  Thing1  G  H  I
    3  Thing2  Z  Y  X
    

    Or, if you'd like to include the last row, just add 1 to the slice:

    In [41]: thing3[mask[mask].index[0]: mask[mask].index[1] + 1]
    Out[41]: 
        label  0  1  2
    0  Thing1  A  B  C
    1  Thing1  D  E  F
    2  Thing1  G  H  I
    3  Thing2  Z  Y  X
    4  Thing2  A  V  U