Speaker ID Utterances
0 S1 [alright Sue now it's like uh i dropped like C...
1 S2 [this year? this term?, ri- oh but you dropped...
2 S3 [yeah. hi, hi, yeah i already signed [S2: okay...
3 S4 [back in i was like w- what is that?, yeah and...
4 S5 [okay well i'm not here for a drop-add class [...
5 S6 [me, yeah. that's right, i have a question lik...
6 S7 [hello, hi, what was your name?, i thought i o...
Actually, the end goal is to create a new column where everything under the 'Utterances' column has the punctuation removed and has been tokenized. I just need to turn the list of strings into a string first, right?
P.S. I know the formatting is weird, but I don't know how to fix that and I haven't found an answer anywhere yet. If anyone could tell me how I'm supposed to include the text I'm working with so it doesn't look weird, that would be great. Thanks!
An idea could be:
import pandas as pd
from string import punctuation
import re
df = pd.DataFrame({'Utterances':[["me, yeah. that's right, i have a question lik"], ["hello, hi, what was your name?, i thought i o"]]})
df['Utterances'] = df['Utterances'].str.join(' ')
pattern = r'|'.join([re.escape(e) for e in punctuation])
df['Utterances'] = df['Utterances'].str.replace(pattern, '')