Search code examples
python-3.xpandasdata-cleaning

Issue with removing \n from pandas dataframe


I am trying to get rid of all \n from a whole pandas dataframe. I know there already are answers on this on stack overflow but for some reasons I cant get the desired output. I have the following dataframe:

  title     text    date    authors
0   [ECB completes foreign reserves investment in ...   [\nThe European Central Bank (ECB) completed an ...     [13 June 2017]  ECB
1   [Measures to improve the efficiency of the ope...   [\nThe Governing Council of the ECB has decided ...     [\n 23 January 2003 \n ]    ECB
2   []  []  []  ECB
3   [ECB publishes the results of the Euro Money M...   [Today the European Central Bank (ECB) is publ...   [\n 28 September 2012 \n ]  ECB
4   []  []  []  ECB

This is my desired output:

title   text    date    authors
0   [ECB completes foreign reserves investment in...    [The European Central Bank (ECB) completed an ...   [13 June 2017]  ECB
1   [Measures to improve the efficiency of the ope...   [The Governing Council of the ECB has decided ...   [23 January 2003]   ECB
2   []  []  []  ECB
3   [ECB publishes the results of the Euro Money M...   [Today the European Central Bank (ECB) is publ...   [28 September 2012]     ECB
4   []  []  []  ECB 

These are all codes I tried:

  1. based on this stack overflow post I tried:

    mydf=df.replace({r'\\n': ''}, regex=True)
    
    mydf=df['date'].str.strip(r'\\n') #this turns every obs into NaN 
    
    mydf=df.replace(to_replace=[r"\\n", "\n"], value=["",""], regex=True, inplace =True) #this gets rid of all data in dataframe for some reason
    

neither of which has worked

  1. based on this post I tried (note I am skipping answers which were already tried previously):

    mydf=df.replace(r'\s', '', regex = True, inplace = True) #this deleted all data

  2. based on this post I tried:

    mydf=df.replace('\\n',' ')

  3. based on comments for this post I tried:

    mydf=df['date'].replace(r'\s+|\\n', ' ', regex=True, inplace=True) and

    mydf=df.replace(r'\s+|\\n', ' ', regex=True, inplace=True)

  4. based on answers in this post I tried:

    mydf= df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\n', ' ', regex=True)

    mydf=df.replace({ r'\A\s+|\s+\Z': '', '\n' : ' '}, regex=True, inplace=True) # this again deleted whole df

I dont understand why the answers found there are not working in my case since they were accepted and most of those questions seem to be very similar to mine.


Solution

  • Try:

    df['date']=df['date'].str[0].str.replace(r"\n", "")
    

    That is under the assumption, that each cell in date column is a list with only 1 element. It will also flatten it - so you will get string from that single element.

    IF however date can contain more than one element, and you want to merge them all into single string after you get rid of all \n - try

    df['date']=df['date'].str.join('').str.replace(r"\n", "")
    

    Otherwise, if you wish to keep it in the list format, just stripping all elements of \n try (&& being interim separator):

    df['date']=df['date'].str.join(r'&&').str.replace(r"\n", "").str.split(r'&&')