Search code examples
pythondataframetokenize

How do I turn a column of lists into strings?


  Speaker ID                                         Utterances
0         S1  [alright Sue now it's like uh i dropped like C...
1         S2  [this year? this term?, ri- oh but you dropped...
2         S3  [yeah. hi, hi, yeah i already signed [S2: okay...
3         S4  [back in i was like w- what is that?, yeah and...
4         S5  [okay well i'm not here for a drop-add class [...
5         S6  [me, yeah. that's right, i have a question lik...
6         S7  [hello, hi, what was your name?, i thought i o...

Actually, the end goal is to create a new column where everything under the 'Utterances' column has the punctuation removed and has been tokenized. I just need to turn the list of strings into a string first, right?

P.S. I know the formatting is weird, but I don't know how to fix that and I haven't found an answer anywhere yet. If anyone could tell me how I'm supposed to include the text I'm working with so it doesn't look weird, that would be great. Thanks!


Solution

  • An idea could be:

    import pandas as pd
    from string import punctuation
    import re
    df = pd.DataFrame({'Utterances':[["me, yeah. that's right, i have a question lik"], ["hello, hi, what was your name?, i thought i o"]]})
    
    df['Utterances'] = df['Utterances'].str.join(' ')
    pattern = r'|'.join([re.escape(e) for e in punctuation])
    df['Utterances'] = df['Utterances'].str.replace(pattern, '')