I have pandas data frame with two columns:
fo n bar
[B-inv, B-inv, O, I-acc, O, B-com, I-com, I-com]
I want to insert additional 'O' elements in the annotations list in front of each annotation starting with 'B', which will look like this:
[O, B-inv, O, B-inv, O, I-acc, O, O, B-com, I-com, I-com]
' f o n bar'
And then insert additional whitespace in front of each element with an index equal to the 'B' annotation indexes from the initial annotation: meaning inserting in front of each char from the sentence with index in this list [0,1,5]
Maybe to make it more visibly appealing I should represent it this way:
Ind | Sentence char | Annot |
---|---|---|
0 | f | B-inv |
1 | o | B-inv |
2 | whitespace | O |
3 | n | I-acc |
4 | whitespace | O |
5 | b | B-com |
6 | a | I-com |
7 | r | I-com |
Ind | Sentence char | Annot |
---|---|---|
0 | whitespace | O |
1 | f | B-inv |
2 | whitespace | O |
3 | o | B-inv |
4 | whitespace | O |
5 | n | I-acc |
6 | whitespace | O |
7 | whitespace | O |
8 | b | B-com |
9 | a | I-com |
10 | r | I-com |
from itertools import chain
annot = ['B-inv', 'B-inv', 'O', 'I-acc', 'O', 'B-com', 'I-com', 'I-com']
sent = list('fo n bar')
annot, sent = list(map(lambda l: list(chain(*l)), list(zip(*[(['O', a], [' ', s]) if a.startswith('B') else ([a], [s]) for a,s in zip(annot, sent)]))))
print(annot)
print(''.join(sent))
chain
from itertools
allow you to chain together a list of lists to form a single list. Then the rest is some clumsy use of zip
together with list unpacking (the prefix *
in argument names) to get it in one line. map
is only used to apply the same operation to both lists basically.
But a more readable version, so you can also follow the steps better, could be:
# find where in the annotations the element starts with 'B'
loc = [a.startswith('B') for a in annot]
# Use this locator to add an element and Merge the list of lists with `chain`
annot = list(chain.from_iterable([['O', a] if l else [a] for a,l in zip(annot, loc)]))
sent = ''.join(chain.from_iterable([[' ', a] if l else [a] for a,l in zip(sent, loc)])) # same on sentence
Note that above, I do not use map
as we process each list separately, and there is less zipping and casting to lists. So most probably, a much cleaner, and hence preferred solution.
I am not sure it is the most convenient to do this on a DataFrame. It might be easier on a simple list, before converting to a DataFrame.
But anyway, here is a way through it, assuming you don't really have meaningful indices in your DataFrame (so that indices are simply the integer count of each row).
The trick is to use .str
strings functions such as startswith
in this case to find matching strings in one of the column Series of interest and then you could loop over the matching indices ([0, 1, 5]
in the example) and insert at a dummy location (half index, e.g. 0.5
to place the row before row 1
) the row with the whitespace and 'O'
data. Then sorting by sindices with .sort_index()
will rearrange all rows in the way you want.
import pandas as pd
annot = ['B-inv', 'B-inv', 'O', 'I-acc', 'O', 'B-com', 'I-com', 'I-com']
sent = list('fo n bar')
df = pd.DataFrame({'sent':sent, 'annot':annot})
idx = np.argwhere(df.annot.str.startswith('B').values) # find rows where annotations start with 'B'
for i in idx.ravel(): # Loop over the indices before which we want to insert a new row
df.loc[i-0.5] = [' ', 'O'] # made up indices so that the subsequent sorting will place the row where you want it
df.sort_index().reset_index(drop=True) # this will output the new DataFrame