Search code examples
pythoncsvpdfmultilinetabula

Python: convert pdf to csv (multi-line column)


My CSV is:

,Élément,État général,Observations
0,ENTRÉE,Etat d'usage,
1,PORTES,Etat d'usage,Chaînette cassé
Serrure du bas en mauvais état le système est
cassé au niveau de la chaînette
2,ENTRÉE / PORTESENTRÉE / PORTES,,
3,Type de porte,,Porte blindée
4,Poignée,,Bon état
5,Couleur,,Bois

But i want this:

,Élément,État général,Observations
0,ENTRÉE,Etat d'usage,
1,PORTES,Etat d'usage,Chaînette cassé; Serrure du bas en mauvais état le système ...
2,ENTRÉE / PORTESENTRÉE / PORTES,,
3,Type de porte,,Porte blindée
4,Poignée,,Bon état
5,Couleur,,Bois

My code simply converts one or more pdf to a csv for each page and looks like this:

import os
import io
import shutil
import tabula
import time

start_time = time.time()
path = './'

i=0
j=0

for( directory, subdirectories, file ) in os.walk(path):
    for f in file:
        if f.endswith('.pdf'):
                df = tabula.read_pdf(str(directory) + "/" + str(f), pages='all')
                i=0
                j+=1
                for curr_df in df:
                    i+=1
                    curr_df.to_csv('./' + str(directory) + '-' + str(i) + '.csv')

print("--- convert %d .PDF to %d .CSV in %s seconds ---" % (j, i, time.time() - start_time))

My problem is also due to the fact that I can't do case by case. I need to be able to process all csv in the same way


Solution

  • You could open the csv, read the lines, and add the strings that do not start empty (header) or with a number to the previous line. Then write the lines to a new csv file:

    with open('filename.csv') as f:
        text = [line.rstrip() for line in f.readlines()] #remove newline character with rstrip()
        lines = []
        for i in text:
            try:
                if i[0] ==',' or i[0].isnumeric():
                    lines.append(i)
                else:
                    lines[-1] = lines[-1] + "; " + i
            except:
                continue
                
    with open('new_file.csv', mode='wt', encoding='utf-8') as newfile:
        newfile.write('\n'.join(lines)) # reinsert newline characters with '\n'.join()
    

    To process all files in a directory we can put it in a function and feed all files in a directory to that function:

    import os as os
    import glob as glob
    
    def process_csv(filepath):
        with open(filepath) as f:
            text = [line.rstrip() for line in f.readlines()] #remove newline character with rstrip()
            lines = []
            for i in text:
                try:
                    if i[0] ==',' or i[0].isnumeric():
                        lines.append(i)
                    else:
                        lines[-1] = lines[-1] + "; " + i
                except:
                    continue
    
        with open(os.path.basename(filepath) + '_fixed.csv', mode='wt', encoding='utf-8') as newfile:
            newfile.write('\n'.join(lines)) # reinsert newline characters with '\n'.join()
            print('fixed: ' + os.path.basename(filepath) + '_fixed.csv')
    
    files = glob.glob('./*.csv') #use glob to create a list of filepath of csv files in a directory
    
    for file in files: # loop through the list and feed each file to the function process_csv
        process_csv(file)