I have a CSV file that looks as follows:
This is an reduced example. My original CSV contains thousands of rows. Now I my aim is to cut this CSV into different smaller pieces. However, I want to have the header in each file and the header consists of the first 8 rows, the 9th contains a blank line and this also belongs to the header.
As I am not interested in the data and just cutting it, I tried to find a solution which is fast and does not involve any data modification. I do not want to use Pandas. I would like to avoid reading and writing with csv
module, but I also tried to use the implementation mentioned in an answer to Splitting one csv into multiple files. Problem here is that my header consists not of just one row, but 9 rows in total. And I would like to avoid reading and writing data, as I actually just need to cut it.
I tried to use islice
.
The first CSV can be created as follows:
from itertools import islice
with open(r"C:\Desktop\myfolder\mycsv.csv") as f, open (r"C:\Desktop\myfolder\out.csv", "w") as out:
r = islice(f, 0, 12)
out.writelines(r)
That gives my first CSV, containing 12 rows (I think), including the header. Now I do not know how to cut my second file in such way that it contains data from the 13th row, until lets say 20th row, but including the header. I am not sure if this can be done with islice
. I know how cut to get the data with setting start and stop islice(f, 13 20)
, but then the header is missing. I also thought about cutting one file, the header and then the data file and attaching them. But I couldn't find a way to attach two CSVs, without having to read and write or make use of pandas.
I would try something like this:
Code:
a.csv:
c1 c2
c3 c4
d1 d2
d3 d4
d5 d6
d7 d8
script.py:
with open("a.csv", "r") as inp:
header = []
headersize = 3
for _ in range(0, headersize):
header.append(inp.readline())
filenumber = 1
chunksize = 2
dataline = inp.readline()
while dataline != "":
with open("a_" + str(filenumber) + ".csv", "w") as outp:
for headerline in header:
outp.write(headerline)
for _ in range(0, chunksize):
outp.write(dataline)
dataline = inp.readline()
filenumber += 1
a_1.csv:
c1 c2
c3 c4
d1 d2
d3 d4
a_2.csv:
c1 c2
c3 c4
d5 d6
d7 d8
Instead of using a fixed chunksize, you can implement any other condition for starting a new file.