Search code examples
c#filefilesplitting

processing huge utf8 files with splitting to multiple files


i am developing a importer program for importing large text utf8 (character bytes are different) files in C#. if i load all the 20GB file to RAM, this solution is not suitable and possible. it's better to split file to multiple smaller files to process. Now, my problem is splitting the file foe processing. my solution is reading the file line by line and split them if the lines number is my suitable number. but i think, it is not fast solution to read the file line by line for splitting. splitting time is high. is there a algorithm for splitting large utf8 files to multiple files without reading line by line and faster.


Solution

  • My suggestions for your problem is as below. This I thought keeping in mind of separation of concern, as splitting of file and processing of file can be separated for better maintenance.

    1. Read the file in binary rather than text
    2. Do not read line by line as you don't require reading the file for splitting.
    3. Use seek. Refer link.
    4. In case you need to save the split-ted files with complete lines, then after you seek to position, search for next end of line character and then split file accordingly.
    5. Once files are split-ted, process the files individually.