Given this dataframe
and word_index
dictionary:
import pandas as pd
df = pd.DataFrame(data={'text_ids': [
[1, 2, 3, 2, 7, 2, 8, 2, 0],
[1, 2, 4, 2, 7, 2, 8, 2, 0],
[1, 2, 5, 2, 6, 2, 8, 2, 0],
[1, 2, 9, 2, 6, 2, 10, 2, 11, 2, 8, 0]
]})
word_index = {0: '<eos>', 1: '<sos>', 2: '/s', 3: 'he', 4: 'she', 5:'they', 6:'love', 7:'loves', 8: 'cats', 9: 'we', 10: 'talking', 11: 'about', 12: '<pad>'}
How can I map each sequence in text_ids
to its corresponding value(s) in word_index
, while making sure that \s
really creates spaces in each string? Also, I need to add <pad>
tokens to each string that has a length smaller than the largest integer sequence.
Expected output:
text_ids text
0 [1, 2, 3, 2, 7, 2, 8, 2, 0] <sos> he loves cats <eos><pad><pad><pad>
1 [1, 2, 4, 2, 7, 2, 8, 2, 0] <sos> she loves cats <eos><pad><pad><pad>
2 [1, 2, 5, 2, 6, 2, 8, 2, 0] <sos> they love cats <eos><pad><pad><pad>
3 [1, 2, 9, 2, 6, 2, 10, 2, 11, 2, 8, 0] <sos> we love talking about cats <eos>
You could use map
to assign the values from your dictionary. Ensure to first replace '\s'
with ' '
.
Then reshape your dataframe to wide format with pivot
to ensure the same number of items and fillna
the missing spots with "<pad>"
.
Finally aggregate to a string with apply
and join
to the original dataframe:
word_index[2] = ' '
df2 = df['text_ids'].explode().map(word_index).reset_index()
df.join(
df2.assign(col=df2.groupby('index').cumcount())
.pivot('col', 'index', 'text_ids')
.fillna('<pad>')
.apply(''.join)
.rename('text')
)
output:
text_ids text
0 [1, 2, 3, 2, 7, 2, 8, 2, 0] <sos> he loves cats <eos><pad><pad><pad>
1 [1, 2, 4, 2, 7, 2, 8, 2, 0] <sos> she loves cats <eos><pad><pad><pad>
2 [1, 2, 5, 2, 6, 2, 8, 2, 0] <sos> they love cats <eos><pad><pad><pad>
3 [1, 2, 9, 2, 6, 2, 10, 2, 11, 2, 8, 0] <sos> we love talking about cats<eos>
Another option using apply
:
word_index[2] = ' '
# padding values
l = df['text_ids'].str.len()
pad = (l.max()-l).mul(pd.Series(['<pre>']*len(l)))
df['text'] = df['text_ids'].apply(lambda s: ''.join(word_index[e] for e in s))+pad