Search code examples
unixawksplitgawk

Split file without separating rows beginning with like values in Unix


I have a sorted .csv file that is something like this:

AABB1122,ABC,BLAH,4
AABB1122,ACD,WHATEVER,1
AABB1122,AGT,CAT,4
CCDD4444,AYT,DOG,4
CCDD4444,ACG,MUMMY,8
CCEE4444,AOP,RUN,5
DDFF9900,TUI,SAT,33
DDFF9900,WWW,INDOOR,5

I want to split the file into smaller files of roughly two lines each, but I do not want rows with like values in the first column separated.

Here, I would end up with three files:

x00000:

AABB1122,ABC,BLAH,4
AABB1122,ACD,WHATEVER,1
AABB1122,AGT,CAT,4

x00001:

CCDD4444,AYT,DOG,4
CCDD4444,ACG,MUMMY,8

x00002:

CCEE4444,AOP,RUN,5
DDFF9900,TUI,SAT,33
DDFF9900,WWW,INDOOR,5 

My actual data is about 7 gigs in size and contains over 100 million lines. I want to split it into files of about 100K lines each or ~6MB. I am fine with using either file size or line numbers for splitting.

I know that I can use "sort" to split, such as:

split -a 5 -d -1 2

Here, that would give me four files, and like values in the first column would be split over files in most cases.

I think I probably need awk, but, even after reading through the manual, I am not sure how to proceed.

Help is appreciated! Thanks!


Solution

  • An awk script:

    BEGIN   { FS = ","  }
    !name   { name = sprintf("%06d-%s.txt", NR, $1) }
    
    count >= 2 && prev != $1  {
        close(name)
        name = sprintf("%06d-%s.txt", NR, $1)
        count = 0
    }
    
    {
        print >name
        prev = $1
        ++count
    }
    

    Running this on the given data will create three files:

    $ awk -f script.awk file.csv
    
    $ cat 000001-AABB1122.txt
    AABB1122,ABC,BLAH,4
    AABB1122,ACD,WHATEVER,1
    AABB1122,AGT,CAT,4
    
    $ cat 000004-CCDD4444.txt
    CCDD4444,AYT,DOG,4
    CCDD4444,ACG,MUMMY,8
    
    $ cat 000006-CCEE4444.txt
    CCEE4444,AOP,RUN,5
    DDFF9900,TUI,SAT,33
    DDFF9900,WWW,INDOOR,5
    

    I have arbitrarily chosen to use the line number from the original file from where the first line was taken, along with the first field's data on that line as the filename.

    The script counts the number of lines printed to the current output file, and if that number is greater than or equal to 2, and if the first field's value is different from the previous line's first field, the current output file is closed, a new output name is constructed, and the count is reset.

    The last block simply prints to the current filename, remembers the first field in the prev variable, and increments the count.

    The BEGIN block initializes the field delimiter (before the first line is read) and the !name block sets the initial output file name (when reading the very first line).


    To get exactly the filenames that you have in the question, use

    name = sprintf("x%05d", ++n)
    

    to set the output filename in both places where this is done.