Python: convert pdf to csv (multi-line column)

My CSV is:

,Élément,État général,Observations
0,ENTRÉE,Etat d'usage,
1,PORTES,Etat d'usage,Chaînette cassé
Serrure du bas en mauvais état le système est
cassé au niveau de la chaînette
2,ENTRÉE / PORTESENTRÉE / PORTES,,
3,Type de porte,,Porte blindée
4,Poignée,,Bon état
5,Couleur,,Bois

But i want this:

,Élément,État général,Observations
0,ENTRÉE,Etat d'usage,
1,PORTES,Etat d'usage,Chaînette cassé; Serrure du bas en mauvais état le système ...
2,ENTRÉE / PORTESENTRÉE / PORTES,,
3,Type de porte,,Porte blindée
4,Poignée,,Bon état
5,Couleur,,Bois

My code simply converts one or more pdf to a csv for each page and looks like this:

import os
import io
import shutil
import tabula
import time

start_time = time.time()
path = './'

i=0
j=0

for( directory, subdirectories, file ) in os.walk(path):
    for f in file:
        if f.endswith('.pdf'):
                df = tabula.read_pdf(str(directory) + "/" + str(f), pages='all')
                i=0
                j+=1
                for curr_df in df:
                    i+=1
                    curr_df.to_csv('./' + str(directory) + '-' + str(i) + '.csv')

print("--- convert %d .PDF to %d .CSV in %s seconds ---" % (j, i, time.time() - start_time))

My problem is also due to the fact that I can't do case by case. I need to be able to process all csv in the same way

Solution

You could open the csv, read the lines, and add the strings that do not start empty (header) or with a number to the previous line. Then write the lines to a new csv file:

with open('filename.csv') as f:
    text = [line.rstrip() for line in f.readlines()] #remove newline character with rstrip()
    lines = []
    for i in text:
        try:
            if i[0] ==',' or i[0].isnumeric():
                lines.append(i)
            else:
                lines[-1] = lines[-1] + "; " + i
        except:
            continue
            
with open('new_file.csv', mode='wt', encoding='utf-8') as newfile:
    newfile.write('\n'.join(lines)) # reinsert newline characters with '\n'.join()

To process all files in a directory we can put it in a function and feed all files in a directory to that function:

import os as os
import glob as glob

def process_csv(filepath):
    with open(filepath) as f:
        text = [line.rstrip() for line in f.readlines()] #remove newline character with rstrip()
        lines = []
        for i in text:
            try:
                if i[0] ==',' or i[0].isnumeric():
                    lines.append(i)
                else:
                    lines[-1] = lines[-1] + "; " + i
            except:
                continue

    with open(os.path.basename(filepath) + '_fixed.csv', mode='wt', encoding='utf-8') as newfile:
        newfile.write('\n'.join(lines)) # reinsert newline characters with '\n'.join()
        print('fixed: ' + os.path.basename(filepath) + '_fixed.csv')

files = glob.glob('./*.csv') #use glob to create a list of filepath of csv files in a directory

for file in files: # loop through the list and feed each file to the function process_csv
    process_csv(file)