While reading the PDF table using camelot some columns are concatenated and their values too like below
Date | Facture-ref\nfactureid| Description\items| Payé\nEscompte |Déboursé\nPaiement\net
--------------------------------------------------------------------
12/12/2019| 45333\n34343 | 7899\nscrewio | 89\n0.00 |3443.12\n231232.00\n456
12/12/2019| 453343\n3434 | 7845\nscrewio | 78\n0.00 |34.12\n232.00\455
my output should be like below
Date |facture | Facture | factureid | Description| items | Payé | Escompte| Déboursé | Paiement | net
-----------------------------------------------------
12/12/2017 | 45333 | 34343 | #al: 7899 | screwio | 89 | 0.00 | 3443.12| 231232.00 | 456|
12/12/2017 | 453343 | 3434 | #rfp: 7845 | screwio | 78 | 0.00 | 34.12 | 232.00 | 455 |
I want columns and values to be separated based on newline.
you experiment create using the below df
data = [['12/12/2019', '45333\n34343','7899\nscrewio','89\n0.00','3443.12\n231232.00\n456'], ['12/12/2019', '232\n3434','7845\nnuts','78\n0.00','34.12\n232.00\455'] ]
df33 = pd.DataFrame(data, columns = ['Date', 'Facture-ref\nfactureid','Description\nitems','Payé\nEscompte','Déboursé\nPaiement\nnet'])
I'm sure there's a more elegant way of doing this, but this should work.
edit: with and without dropped. Feel free to post a sample dataset if I'm misunderstanding you. sample data:
>>> data = {'col_1':['a']*5,
'col_2\ncol_3':['b\nc']*5,
'col_4':['d']*5}
>>> df = pd.DataFrame(data)
>>> df
col_1 col_2\ncol_3 col_4
0 a b\nc d
1 a b\nc d
2 a b\nc d
3 a b\nc d
4 a b\nc d
with drop:
>>> for col in [i for i in df.columns if '\n' in i]:
df[col.split('\n')]=df[col].str.split('\n',expand=True)
df.drop(columns=col,inplace=True)
>>> df
col_1 col_4 col_2 col_3
0 a d b c
1 a d b c
2 a d b c
3 a d b c
4 a d b c
without drop:
>>> for col in [i for i in df.columns if '\n' in i]:
df[col.split('\n')]=df[col].str.split('\n',expand=True)
>>> df
col_1 col_2\ncol_3 col_4 col_2 col_3
0 a b\nc d b c
1 a b\nc d b c
2 a b\nc d b c
3 a b\nc d b c
4 a b\nc d b c