Search code examples
pythonpandasnlpbleu

Split several sentences in pandas dataframe


I have a pandas dataframe with a column that looks like this.

sentences
['This is text.', 'This is another text.', 'This is also text.', 'Even more text.']
['This is the same in another row.', 'Another row another text.', 'Text in second row.', 'Last text in second row.']

In every row there are 10 sentences in ' ' or " " separated by commas. The column type is "str". I was not able to transform it to a list of strings.

I want to transform the values of this dataframe that they look like this:

[['This', 'is', 'text'], ['This', 'is', 'another', 'text'], ['This', 'is', 'also', 'text'], ['Even', 'more', 'text']]

I tried something like this:

    new_splits = []
    for num in range(len(refs)):
      komma = refs[num].replace(" ", "\', \'")#regex=True)
      new_splits.append(komma)

and this:

    new_splits = []
    for num in range(len(refs)):
      splitted = refs[num].split("', '")
      new_splits.append(splitted)

Disclaimer: I need this for evaluating bleu score and haven't found a way to do this for this kind of dataset. Thanks in advance!


Solution

  • You can use np.char.split in one line:

    df['separated'] = np.char.split(df['sentences'].tolist()).tolist()
    

    @Kata if you think the sentences column type is str meaning the element in each row is a string instead of a list, for e.g. "['This is text.', 'This is another text.', 'This is also text.', 'Even more text.']" then you need to try to convert them into lists first. One way is to use ast.literal_eval.

    from ast import literal_eval
    df['sentences'] = df['sentences'].apply(literal_eval)
    df['separated'] = np.char.split(df['sentences'].tolist()).tolist()
    

    NOTE on data: This is not a recommended way of storing data. If possible fix the source from which data is coming. It needs to be strings in each cell not lists preferably, or at least just lists, and not a string representing list.