Search code examples
pythonnlpspacybert-language-modelpython-re

Removing commas after processing lists of strings, when ' '.join(x) does not work


So I fed in a dataframe of sentences for token prediction in BERT, and I received as output along with the predictions, the sentences split into words. Now i want to revert my dataframe of the split/tokenized sentences and predictions back to the original sentence.(of course i have the original sentence, but i need to do this process so that the predictions are in harmony with the sentence tokens)

original sentence
You couldn't have done any better because if you could have, you would have.

Post processing
['[CLS]', 'You', 'couldn', "'", 't', 'have', 'done', 'any', 'better', 'because', 'if', 'you', 'could', 'have', ',', 'you', 'would', 'have', '.', '[SEP]']

I identified three processes necessary. 1. Remove quote marks 2. removes the CLS ,SEP and their extra quote marks and commas, 3. remove the commas separating the words and merge them.

def fix_df(row):
    sentences = row['t_words'] 
    return remove_edges(sentences)

def remove_edges(sentences):
    x = sentences[9:-9]
    return remove_qmarks(x)

def remove_qmarks(x):
    y = x.replace("'", "")
    return join(y)

def join(y):
    z = ' '.join(y)
    return z


a_df['sents'] = a_df.apply(fix_df, axis=1) 

The first two functions largely worked correctly, but the last one did not. instead, i got a result that looked like this.

Y o u , c o u l d n , " " , t , h a v e, d o n e ,...

The commas didnt go away, and the text got distorted instead. I am definitely missing something. what could that be?


Solution

  • The result string really, really looks like a string representation of an otherwise perfectly normal list, so let's have Python convert it back to a list, safely, per Convert string representation of list to list:

    import ast
    
    result = """['[CLS]', 'You', 'couldn', "'", 't', 'have', 'done', 'any', 'better', 'because', 'if', 'you', 'could', 'have', ',', 'you', 'would', 'have', '.', '[SEP]']"""
    
    result_as_list = ast.literal_eval(result)
    

    Now we have this

    ['[CLS]', 'You', 'couldn', "'", 't', 'have', 'done', 'any', 'better', 'because', 'if', 'you', 'could', 'have', ',', 'you', 'would', 'have', '.', '[SEP]']
    

    let's go over your steps again. First, "remove the quote marks". But there aren't any (obsolete) quote marks, because this is a list of strings; the extra quotes you see in the representation are only because that is how a string is represented in Python.

    Next, "remove the beginning and end markers". As this is a list, they're just the first and last elements, no further counting needed:

    result_as_list = result_as_list[1:-1]
    

    Next, "remove the commas". As in the first step, there are no (obsolete) comma's; they are part of how Python shows a list and are not there in the actual data.

    So we end up with

    ['You', 'couldn', "'", 't', 'have', 'done', 'any', 'better', 'because', 'if', 'you', 'could', 'have', ',', 'you', 'would', 'have', '.']
    

    which can be joined back into the original string using

    result_as_string = ' '.join(result_as_list)
    

    and the only problem remaining is that BERT apparently treats apostrophes, commas and full stops as separate 'words':

    You couldn ' t have done any better because if you could have , you would have .
    

    which need a bit o'replacing:

    result_as_string = result_as_string.replace(' ,', ',').replace(' .','.').replace(" ' ", "'")
    

    and you have your sentence back:

    You couldn't have done any better because if you could have, you would have.
    

    The only problem I see is if there are leading or closing quotes that aren't part of a contraction. If this is necessary, you can replace the space-quote-space replacement with a more focused one targeting specifically "couldn't", "can't", "aren't" etc.