Search code examples
pythoncsvformattingcsvreader

Importing and formatting a CSV in Python for plotting


I am fairly new to Python but understand what it is I need help with.

I am trying to import a csv file which looks like this:

Title of Project
name1, Info 1
name2, Info 2
name3, Info 3
name4, Info 4
name5, Info 5
a,b,c,d,e,f,g,h,i,j,k,l,m
a1,b1,c1,d1,e1,f1,g1,h1,i1,j1,k1,l1,m1
...
...
an,bn,cn,dn,en,fn,gn,hn,in,jn,jn,kn,ln,mn

All of the commas are actually delimited by '\t' and I don't know how to deal with that?

So overall, I want to remove the title and the next 5 rows and only work with the data from row 7 to the nth row in the file for plotting purposes.

The row that holds 'a,b,c,d,etc' need to be headers and the data under it should be separated accordingly into the columns.

I need help with syntax and understanding the most efficient way to do this. Any help or guidance would be much appreciated!

Thanks.


Solution

  • You can use Python's built-in CSV module to handle this.

    My example reads your sample data with a tab-delimiter, discards the first 6 rows, and saves the rest of the rows to write out to a new comma-delimited file.

    You first need to open the file for reading and create a CSV reader object:

    import csv
    
    with open('input.csv', newline='') as f:
        reader = csv.reader(f, delimiter='\t')
        ...
    

    You can iterate over the rows of the reader a couple of different ways.

    In your case, to skip a number of rows, call next(reader) a number of times in sequence, ignoring the data it returns:

        ...
        for i in range(1, 7):
            next(reader)  # manually advance reader / discard rows
        ...
    

    You can also iterate the rows of the reader with a standard for-loop. I'm creating the all_rows list to hold the rows you really care about:

        ...
        all_rows = []
        for row in reader:
            all_rows.append(row)
        ...
    

    For the standard reader object, there's nothing special about headers, it's totally up to you if it exists or has any special value... to the reader, it's just a row like any other:

    print(all_rows)
    
    [
     ['a' , 'b' , 'c' , 'd' , 'e' , 'f' , 'g' , 'h ', 'i' , 'j' , 'k' , 'l' , 'm'],
     ['a1', 'b1', 'c1', 'd1', 'e1', 'f1', 'g1', 'h1', 'i1', 'j1', 'k1', 'l1', 'm1'],
     ['an', 'bn', 'cn', 'dn', 'en', 'fn', 'gn', 'hn', 'in', 'jn', 'kn', 'ln', 'mn']
    ]
    

    Now, you can do whatever you need with all_rows, feed it to a plotter, or save it as a cleaned-up CSV:

    with open('output.csv', 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerows(all_rows)
    

    Here's the official docs for the CSV module. It's a good reference to remind of you how it works, once you have the hang of it, but I don't think it's really very good to get a basic understanding of what's going on.

    I've written my own how-to CSV w/Python guide. I take a little more time to go over more basic things, but it's the simple things I see people here on SO getting tripped up on.