Search code examples
shellawkreadline

text file -- how to sort adjacent lines that have the same level of indentation


UPDATE: The root problem has been solved by fixing a number of Sequelize migrations that always run before mysqldump is called, as discussed in the comments that are below the article that is linked in the next paragraph. However, the core technical challenge is still interesting.

I have a problem with mysqldump that might be solved by configuring mysqldump differently, but probably will be solved by just piping the output through a shell script.

Basically, mysqldump always outputs the tables in the same order, but it list all columns (other than id) for each table in random order.

So, the first run might output this...

create TABLE `ONE` ( 
  `id` int NOT NULL AUTO_INCREMENT,
  `column_a` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_b` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_c` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;
create TABLE `TWO` ( 
  `id` int NOT NULL AUTO_INCREMENT,
  `column_x` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_y` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_z` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

...and on the second run, it might produce something like this:

create TABLE `ONE` ( 
  `id` int NOT NULL AUTO_INCREMENT,
  `column_c` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_b` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_a` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;
create TABLE `TWO` ( 
  `id` int NOT NULL AUTO_INCREMENT,
  `column_y` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_x` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_z` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

I'd like to pipe the result through a shellscript that always sorts the lines in the same way. What script would achieve this? The script needs to run on a build agent that runs on Ubuntu, so if it is possible and practical to use standard GNU tools like awk then that would be superior to using custom tools.


Solution

  • Assumptions:

    • the input is guaranteed to have a very specific format that follows exactly the sample provided in the question, namely ...
    • each clause (of the create table command) resides on a separate line (eg, we won't see 2 columns listed on a single line)
    • there is only one level of indents
    • all column clauses contain the string NOT NULL or NULL
    • we do not sort the column clause containing the string AUTO-INCREMENT
    • we do not sort the PRIMARY KEY clause
    • all sortable column clauses show up between the AUTO-INCREMENT column clause and the PRIMARY KEY clause
    • NOTE: we are not going to address any of the slew of other options available with the mysql 'create table' command, ie, we're not going to build a full-fledged parser

    One approach:

    • if a line matches the format of a 'sortable column clause' then we add it to an array
    • if a line does not match the format of a 'sortable column clause' then we a) dump anything currently in the array to stdout and then b) print the current line to stdout

    One GNU awk implementation of this approach:

    awk '
    BEGIN                { PROCINFO["sorted_in"]="@val_str_asc"     # sort arrays by value, sorted as string in asc[ending] order
                           delete lines                             # designate variable "lines" as an array
                         }
    
    /^[[:space:]]/    &&                                            # if line starts with white space (ie, it is indented) and ...
    !/AUTO_INCREMENT/ &&                                            # line does not contain string "AUTO_INCREMENT" and ...
    / NULL/              { lines[++cnt] = $0                        # line contains a single white space + string "NULL", then save current line in array
                           next                                     # skip to next line of input
                         }
    
    cnt                  { for (i in lines)                         # loop through indices of array in @val_str_asc sorted order
                               print lines[i]                       # print array value to stdout
                           delete lines                             # reset array
                           cnt = 0                                  # reset counter (aka array index)
                         }
    1                                                               # print current line of input to stdout
    ' sample.sql
    

    NOTES:

    • sample.sql is an exact copy of the 2nd pair of create table commands from OP's question (ie, the example with the out-of-order column_X rows)
    • requires GNU awk for PROCINFO["sort_in"] support which in turn is used to sort the contents of the array; see gawk predefined sorting orders for details

    This generates:

    create TABLE `ONE` (
      `id` int NOT NULL AUTO_INCREMENT
      `column_a` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
      `column_b` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
      `column_c` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
      PRIMARY KEY (`id`)
    ) ENGINE=InnoDB;
    create TABLE `TWO` (
      `id` int NOT NULL AUTO_INCREMENT,
      `column_x` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
      `column_y` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
      `column_z` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
      PRIMARY KEY (`id`)
    ) ENGINE=InnoDB;