Search code examples
pythonpandaslistdataframenlp

Map column lists to dictionary and create new column with padded strings


Given this dataframe and word_index dictionary:

import pandas as pd

df = pd.DataFrame(data={'text_ids': [
                                     [1, 2, 3, 2, 7, 2, 8, 2, 0],
                                     [1, 2, 4, 2, 7, 2, 8, 2, 0],
                                     [1, 2, 5, 2, 6, 2, 8, 2, 0],
                                     [1, 2, 9, 2, 6, 2, 10, 2, 11, 2, 8, 0]
                                    ]})

word_index = {0: '<eos>', 1: '<sos>', 2: '/s', 3: 'he', 4: 'she', 5:'they', 6:'love', 7:'loves', 8: 'cats', 9: 'we', 10: 'talking', 11: 'about', 12: '<pad>'}

How can I map each sequence in text_ids to its corresponding value(s) in word_index, while making sure that \s really creates spaces in each string? Also, I need to add <pad> tokens to each string that has a length smaller than the largest integer sequence.

Expected output:

                                 text_ids                                       text
0             [1, 2, 3, 2, 7, 2, 8, 2, 0]   <sos> he loves cats <eos><pad><pad><pad>
1             [1, 2, 4, 2, 7, 2, 8, 2, 0]  <sos> she loves cats <eos><pad><pad><pad>
2             [1, 2, 5, 2, 6, 2, 8, 2, 0]  <sos> they love cats <eos><pad><pad><pad>
3  [1, 2, 9, 2, 6, 2, 10, 2, 11, 2, 8, 0]     <sos> we love talking about cats <eos>

Solution

  • You could use map to assign the values from your dictionary. Ensure to first replace '\s' with ' '.

    Then reshape your dataframe to wide format with pivot to ensure the same number of items and fillna the missing spots with "<pad>".

    Finally aggregate to a string with apply and join to the original dataframe:

    word_index[2] = ' '
    
    df2 = df['text_ids'].explode().map(word_index).reset_index()
    
    df.join(
     df2.assign(col=df2.groupby('index').cumcount())
        .pivot('col', 'index', 'text_ids')
        .fillna('<pad>')
        .apply(''.join)
        .rename('text')
    )
    

    output:

                                     text_ids                                       text
    0             [1, 2, 3, 2, 7, 2, 8, 2, 0]   <sos> he loves cats <eos><pad><pad><pad>
    1             [1, 2, 4, 2, 7, 2, 8, 2, 0]  <sos> she loves cats <eos><pad><pad><pad>
    2             [1, 2, 5, 2, 6, 2, 8, 2, 0]  <sos> they love cats <eos><pad><pad><pad>
    3  [1, 2, 9, 2, 6, 2, 10, 2, 11, 2, 8, 0]      <sos> we love talking about cats<eos>
    

    Another option using apply:

    word_index[2] = ' '
    
    # padding values
    l = df['text_ids'].str.len()
    pad = (l.max()-l).mul(pd.Series(['<pre>']*len(l)))
    
    df['text'] = df['text_ids'].apply(lambda s: ''.join(word_index[e] for e in s))+pad