Search code examples
pythonlistlist-comprehension

Insert elements in front of specific list elements


I have pandas data frame with two columns:

  • sentence - fo n bar
  • annotations [B-inv, B-inv, O, I-acc, O, B-com, I-com, I-com]

I want to insert additional 'O' elements in the annotations list in front of each annotation starting with 'B', which will look like this:

[O, B-inv, O, B-inv, O, I-acc, O, O, B-com, I-com, I-com]
' f o n  bar'

And then insert additional whitespace in front of each element with an index equal to the 'B' annotation indexes from the initial annotation: meaning inserting in front of each char from the sentence with index in this list [0,1,5]

Maybe to make it more visibly appealing I should represent it this way:

  • Initial sentence:
Ind Sentence char Annot
0 f B-inv
1 o B-inv
2 whitespace O
3 n I-acc
4 whitespace O
5 b B-com
6 a I-com
7 r I-com
  • End sentence:
Ind Sentence char Annot
0 whitespace O
1 f B-inv
2 whitespace O
3 o B-inv
4 whitespace O
5 n I-acc
6 whitespace O
7 whitespace O
8 b B-com
9 a I-com
10 r I-com

Solution

  • Updated answer (list comprehension)

    from itertools import chain
    annot = ['B-inv', 'B-inv', 'O', 'I-acc', 'O', 'B-com', 'I-com', 'I-com']
    sent = list('fo n bar')
    
    annot, sent = list(map(lambda l: list(chain(*l)), list(zip(*[(['O', a], [' ', s]) if a.startswith('B') else ([a], [s]) for a,s in zip(annot, sent)]))))
    
    print(annot)
    print(''.join(sent))
    

    chain from itertools allow you to chain together a list of lists to form a single list. Then the rest is some clumsy use of zip together with list unpacking (the prefix * in argument names) to get it in one line. map is only used to apply the same operation to both lists basically.

    But a more readable version, so you can also follow the steps better, could be:

    # find where in the annotations the element starts with 'B'
    loc = [a.startswith('B') for a in annot]
    # Use this locator to add an element and Merge the list of lists with `chain`
    annot = list(chain.from_iterable([['O', a] if l else [a] for a,l in zip(annot, loc)]))
    sent = ''.join(chain.from_iterable([[' ', a] if l else [a] for a,l in zip(sent, loc)])) # same on sentence
    

    Note that above, I do not use map as we process each list separately, and there is less zipping and casting to lists. So most probably, a much cleaner, and hence preferred solution.


    Old answer (pandas)

    I am not sure it is the most convenient to do this on a DataFrame. It might be easier on a simple list, before converting to a DataFrame.

    But anyway, here is a way through it, assuming you don't really have meaningful indices in your DataFrame (so that indices are simply the integer count of each row).

    The trick is to use .str strings functions such as startswith in this case to find matching strings in one of the column Series of interest and then you could loop over the matching indices ([0, 1, 5] in the example) and insert at a dummy location (half index, e.g. 0.5 to place the row before row 1) the row with the whitespace and 'O' data. Then sorting by sindices with .sort_index() will rearrange all rows in the way you want.

    import pandas as pd
    annot = ['B-inv', 'B-inv', 'O', 'I-acc', 'O', 'B-com', 'I-com', 'I-com']
    sent = list('fo n bar')
    df = pd.DataFrame({'sent':sent, 'annot':annot})
    
    idx = np.argwhere(df.annot.str.startswith('B').values) # find rows where annotations start with 'B'
    
    for i in idx.ravel(): # Loop over the indices before which we want to insert a new row
      df.loc[i-0.5] = [' ', 'O'] # made up indices so that the subsequent sorting will place the row where you want it
    
    df.sort_index().reset_index(drop=True) # this will output the new DataFrame