Search code examples
pandasdataframenewlinepython-camelot

Data frame columns contains many newline (\n) and its value respectively .How to separate it as new columns and values too


While reading the PDF table using camelot some columns are concatenated and their values too like below

Date      | Facture-ref\nfactureid|     Description\items|  Payé\nEscompte  |Déboursé\nPaiement\net
--------------------------------------------------------------------
12/12/2019| 45333\n34343          |     7899\nscrewio    |  89\n0.00        |3443.12\n231232.00\n456
12/12/2019| 453343\n3434          |     7845\nscrewio    |  78\n0.00        |34.12\n232.00\455

my output should be like below

Date        |facture |      Facture |      factureid | Description| items  |   Payé |  Escompte|  Déboursé |   Paiement |    net
-----------------------------------------------------
 12/12/2017 |  45333 |       34343  |     #al: 7899  |    screwio |    89  |   0.00 |   3443.12|   231232.00 |   456|
 12/12/2017 | 453343 |         3434 |    #rfp: 7845  |    screwio |    78  |   0.00 |    34.12 |     232.00  |   455  |

I want columns and values to be separated based on newline.

you experiment create using the below df

data = [['12/12/2019', '45333\n34343','7899\nscrewio','89\n0.00','3443.12\n231232.00\n456'], ['12/12/2019', '232\n3434','7845\nnuts','78\n0.00','34.12\n232.00\455'] ]

df33 = pd.DataFrame(data, columns = ['Date', 'Facture-ref\nfactureid','Description\nitems','Payé\nEscompte','Déboursé\nPaiement\nnet'])


Solution

  • I'm sure there's a more elegant way of doing this, but this should work.

    edit: with and without dropped. Feel free to post a sample dataset if I'm misunderstanding you. sample data:

    >>> data = {'col_1':['a']*5,
        'col_2\ncol_3':['b\nc']*5,
        'col_4':['d']*5}
    >>> df = pd.DataFrame(data)
    >>> df
      col_1 col_2\ncol_3 col_4
    0     a         b\nc     d
    1     a         b\nc     d
    2     a         b\nc     d
    3     a         b\nc     d
    4     a         b\nc     d
    

    with drop:

    >>> for col in [i for i in df.columns if '\n' in i]:
        df[col.split('\n')]=df[col].str.split('\n',expand=True)
        df.drop(columns=col,inplace=True)
    >>> df
      col_1 col_4 col_2 col_3
    0     a     d     b     c
    1     a     d     b     c
    2     a     d     b     c
    3     a     d     b     c
    4     a     d     b     c
    

    without drop:

    >>> for col in [i for i in df.columns if '\n' in i]:
        df[col.split('\n')]=df[col].str.split('\n',expand=True)
    >>> df
      col_1 col_2\ncol_3 col_4 col_2 col_3
    0     a         b\nc     d     b     c
    1     a         b\nc     d     b     c
    2     a         b\nc     d     b     c
    3     a         b\nc     d     b     c
    4     a         b\nc     d     b     c