Search code examples
pythonpandasnewlinepython-re

Remove newline characters from pandas series of lists


I have a pandas DataFrame that contains two columns, one of tags containing numbers and the other with a list containing string elements.

Dataframe:

df = pd.DataFrame({
   'tags': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}, 
    'elements': {
        0: ['\n☒', '\nANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 '],
        1: ['', ''],
        2: ['\n', '\nFor the Fiscal Year Ended June 30, 2020'],
        3: ['\n', '\n'],
        4: ['\n', '\nOR']
    }
})

I am trying to remove all instances of \n from any element in all the lists from the column elements but I'm really struggling to do so. My solution was to use a nested loop and re.sub() to trying and replace these but it has done nothing (granted this is a horrible solution). This was my attempt:


for ls in range(len(page_table.elements)):
    for st in range(len(page_table.elements[i])):
        page_table.elements[i][st] = re.sub('\n', '', page_table.elements[i][st])

Is there a way to do this?


Solution

  • You can explode and then replace the \n values.
    You can leave out the .groupby(level=0).agg(list) to not put them back into lists, though this will have a different shape to the original DataFrame.

    df["elements"] = (
        df["elements"]
        .explode()
        .str.replace(r"\n", "", regex=True)
        .groupby(level=0)
        .agg(list)
    )
    

    Which outputs:

    0    [☒, ANNUAL REPORT PURSUANT TO SECTION 13 OR 15...
    1                                                 [, ]
    2          [, For the Fiscal Year Ended June 30, 2020]
    3                                                 [, ]
    4                                               [, OR]