Search code examples
pythonstringcsvpdfexport-to-csv

Issues with row and column output when using csv.writer and a series of strings


I have a set of pdfs that I am attempting to extract data from for analysis. As part of this process I want to modify and export this data into a .csv file. Thus far I have been able to successfully extract my data with pdfplumber from my pdfs.

This portion of the data is a set of strings that look like:

 Deer W Pre 4-3F
 Deer W Post 2-1F
 DG Post 7F
 S Pre 2-12F
 Staff Post 3-1F
 Staff Pre 2-10F
 Staff Post 2-11F
 Tut Post 2-1F

I am trying to use csv.writer to write this series of strings into a .csv file with all strings ending up in the same column, but with each string in their own row. I have done a lot of digging on here but have not been able to find a solution to my issue. The code I have used is:

        with open("output.csv", mode="a+") as fp:
            wr = csv.writer(fp, dialect="excel")
            for item in site_tree_info: #site_tree_info is the variable that stores the strings
               wr.writerow([str(item)]) 

This is giving me a rather strange output:

Incorrect csv output

Do folks have a suggestion about what to do to receive my anticipated output:

Anticipated csv output

I really don't understand why [str(string)] is not working for me here, as it has worked for lots of other folks with similar issues.

This is the code I used to create the strings listed above:

# Get list of output pdf files in our directory

meta_sample = re.compile(r'^[A-Z].*') #this is to pull text from page 1

for root, dirs, files in os.walk('/Users/myname/tree'):
    for filename in files:
        p = os.path.join(root, filename)
        #print(p)
        with pdfplumber.open(p) as pdf:
            #pull text from the first page of pdfs which includes information about the samples and the conditions they were analyzed under
            sample_info = pdf.pages[0]
            sample_info_text = sample_info.extract_text()   
            sample_info_text_split = sample_info_text.split('\n')

        
        for lines in sample_info_text_split: 
            if meta_sample.match(lines):
                column_name, *column_info = lines.split(':')
                column_info = ' '.join(column_info)
                #print(column_info) #we have accurately captured both left and right sides of the table from page 1
        
        #This prints Sample ID and sample site/tree info, which is the 2nd item [2] in the sample_info_text_split string
        #We then strip the string of the ":" and split the string into two at that point. I then grab the 2nd item in this split string [1] which prints the site and tree info
        site_tree_info = sample_info_text_split[2].strip().split(":", 1)[1]
        print(site_tree_info) #this prints as above

Solution

  • simple explanation is your site_tree_info variable is a str so when you are looping over it, it is creating new row for every character so i will suggest you instead of string use list for site_tree_info like this (i am assuming data is like this)

    site_tree_info  = ['Deer W Pre 4-3F','Deer W Post 2-1F']