Search code examples
pythonyield

Splitting a large file into chunks


I've a file with 7946479 records, i want to read the file line by line and insert into the database(sqlite). My first approach was open the file read the records line by line and insert into the database at the same time, since it dealing with huge amount of data it taking very long time.I want to change this naive approach so when i searched on internet i saw this [python-csv-to-sqlite][1] in this they have the data in a csv file but the file i have is dat format but i like the answer to that problem so now i am trying to do it like in the solution. https://stackoverflow.com/questions/5942402/python-csv-to-sqlite

The approach they using is like first they splitting the whole file into chunks then doing the database transaction instead of writing each record one at a time.

So i started writing a code for splitting my file into chunks Here is my code,

file = r'files/jan.dat'
test_file = r'random_test.txt'


def chunks(file_obj, size=10000):
counter = 0
file_chunks = []
temp_chunks = []

for line in file_obj:
    if line == '\n':
        continue
    if counter != size:
        temp_chunks.append(line)
        counter += 1
    else:
        file_chunks.append(temp_chunks)
        temp_chunks = []
        counter = 0
file_obj.close()
if len(temp_chunks) != 0:
    file_chunks.append(temp_chunks)

yield file_chunks

if __name__ == '__main__':
    split_files = chunks(open(test_file))
    for chunk in split_files:
        print(len(chunk))

the output is 795, but what i wanted is to split the whole file into chunks of size 10000

i can't figure out what is going wrong here, i can't share my whole file here so for testing can use this code to generate a file with 7946479 lines

TEXT = 'Hello world'
FILE_LENGTH = 7946479

counter = 0
with open(r'random_test.txt', 'w') as f:
    for _ in range(FILE_LENGTH):
        f.write(f"{TEXT}\n")

this is how my original file looks like (the file format is dat)

lat lon day mon t2m rh2m    sf  ws
5   60  1   1   299.215 94.737  209.706 5.213
5   60.25   1   1   299.25  94.728  208.868 5.137
5   60.5    1   1   299.295 94.695  207.53  5.032
5   60.75   1   1   299.353 94.623  206.18  4.945
5   61  1   1   299.417 94.522  204.907 4.833
5   61.25   1   1   299.447 94.503  204.219 4.757
5   61.5    1   1   299.448 94.525  203.933 4.68
5   61.75   1   1   299.443 94.569  204.487 4.584
5   62  1   1   299.44  94.617  204.067 4.464

Solution

  • An easy way to chunk the file is to use f.read(size) until there is no content left. However this method works with character number instead of lines.

    test_file = 'random_test.txt'
    
    
    def chunks(file_name, size=10000):
        with open(file_name) as f:
            while content := f.read(size):
                yield content
    
    
    if __name__ == '__main__':
        split_files = chunks(test_file)
        for chunk in split_files:
            print(len(chunk))
    
    

    For the last chunk, it will take whatever left, here 143 characters


    Same Function with lines

    test_file = "random_test.txt"
    
    
    def chunks(file_name, size=10000):
        with open(file_name) as f:
            while content := f.readline():
                for _ in range(size - 1):
                    content += f.readline()
    
                yield content.splitlines()
    
    
    if __name__ == '__main__':
        split_files = chunks(test_file)
    
        for chunk in split_files:
            print(len(chunk))
    
    
    

    For the last chunk, it will take whatever left, here 6479 lines