Search code examples
mysqlcbashcsvload-data-infile

Fastest way to reformat millions of lines to CSV


I have a text file with millions of lines that should be imported into a MySQL table as quickly as possible. From what I understand LOAD DATA is the best suited for this.

The data is formatted as follows, where each uppercase letter in parenthesis is a string:

(A)(1-3 tabs)(B)
(3 tabs)(C)
(3 tabs)(D)
(3 tabs)(E)

(F)(1-3 tabs)(G)
(3 tabs)(H)
...

The data thus needs to be re-formatted to CSV, where the first string of each section must repeat in all consecutive lines until the next section:

(A)(tab)(B)
(A)(tab)(C)
(A)(tab)(D)
(A)(tab)(E)
(F)(tab)(G)
(F)(tab)(H)
...

I'm considering writing a C program, but can Bash do it just as fast (and simple)? Is this problem maybe a classic one that has a very effective and compact solution?


Solution

  • This is a job for a Perl script; here is one for you. Lightly tested, takes a list of filenames to operate on as command line arguments and/or reads from stdin, writes to stdout. Assumes that the actual number of tabs does not matter, and that there are only ever one or two nonempty fields on the line. (It will complain about and skip any line that isn't in the expected format.)

    #! /usr/bin/perl
    
    our $left;
    while (<>) {
        chomp;
        if (/^([^\t]+)\t+([^\t]+)$/) {
            $left = $1;
            printf("%s\t%s\n", $left, $2);
        } elsif (/^\t+([^\t]+)$/) {
            if (defined $left) {
                printf("%s\t%s\n", $left, $1);
            } else {
                warn "$ARGV:$.: continuation line before leader line\n";
            }
        } else {
            warn "$ARGV:$.: line in unrecognized format\n";
        }
    } continue {
        close ARGV if eof; # reset line numbering for each input file
    }
    

    You might be able to write a C program that beats this for speed, but it's going to be more work than it's worth. A shell script (bash-specific or otherwise) will be orders of magnitude slower.