Search code examples
pythoncsvpython-itertools

Split csv into pieces with header and attach csv files


I have a CSV file that looks as follows:

ex1

This is an reduced example. My original CSV contains thousands of rows. Now I my aim is to cut this CSV into different smaller pieces. However, I want to have the header in each file and the header consists of the first 8 rows, the 9th contains a blank line and this also belongs to the header.

As I am not interested in the data and just cutting it, I tried to find a solution which is fast and does not involve any data modification. I do not want to use Pandas. I would like to avoid reading and writing with csv module, but I also tried to use the implementation mentioned in an answer to Splitting one csv into multiple files. Problem here is that my header consists not of just one row, but 9 rows in total. And I would like to avoid reading and writing data, as I actually just need to cut it.

I tried to use islice.

The first CSV can be created as follows:

from itertools import islice

with open(r"C:\Desktop\myfolder\mycsv.csv") as f, open (r"C:\Desktop\myfolder\out.csv", "w") as out:
    r = islice(f, 0, 12)
    out.writelines(r)

That gives my first CSV, containing 12 rows (I think), including the header. Now I do not know how to cut my second file in such way that it contains data from the 13th row, until lets say 20th row, but including the header. I am not sure if this can be done with islice. I know how cut to get the data with setting start and stop islice(f, 13 20), but then the header is missing. I also thought about cutting one file, the header and then the data file and attaching them. But I couldn't find a way to attach two CSVs, without having to read and write or make use of pandas.


Solution

  • I would try something like this:

    • read the file line by line
    • remember the header lines
    • start a new output file whenever needed, and prepend the header lines every time

    Code:

    a.csv:

    c1 c2 
    c3 c4
    
    d1 d2
    d3 d4
    d5 d6
    d7 d8
    

    script.py:

    with open("a.csv", "r") as inp:
        header = []
        headersize = 3
        for _ in range(0, headersize):
            header.append(inp.readline())
    
        filenumber = 1
        chunksize = 2
        dataline = inp.readline()
        while dataline != "":
            with open("a_" + str(filenumber) + ".csv", "w") as outp:
                for headerline in header:
                    outp.write(headerline)
                for _ in range(0, chunksize):
                    outp.write(dataline)
                    dataline = inp.readline()
            filenumber += 1
    
    

    a_1.csv:

    c1 c2 
    c3 c4
    
    d1 d2
    d3 d4
    

    a_2.csv:

    c1 c2 
    c3 c4
    
    d5 d6
    d7 d8
    

    Instead of using a fixed chunksize, you can implement any other condition for starting a new file.