Search code examples
bashawkcommand-linenumber-formattingstring-parsing

Check to see if numbers in a column are sequential via command line


In a text file, I have a sequence of numbers in a column preceded by a short string. It is the 5th column in the example file here under "NAME":

SESSION NAME:   session
SAMPLE RATE:    48000.000000
BIT DEPTH:  16-bit
SESSION START TIMECODE: 00:00:00:00.00
TIMECODE FORMAT:    24 Frame
# OF AUDIO TRACKS:  2
# OF AUDIO CLIPS:   2
# OF AUDIO FILES:   2


M A R K E R S  L I S T I N G
#       LOCATION        TIME REFERENCE      UNITS       NAME                                COMMENTS
2       0:00.500        24000               Samples     xxxx0001                            
3       0:03.541        170000              Samples     xxxx0002                            
4       0:05.863        281458              Samples     xxxx0003                            
5       0:08.925        428430              Samples     xxxx0004                            
6       0:10.604        509025              Samples     xxxx0005                            
7       0:13.973        670742              Samples     xxxx0006                            
8       0:15.592        748453              Samples     xxxx0008                            
9       0:19.243        923666              Samples     xxxx0008


In the example above, 0007 is missing, and 0008 is duplicated.

Therefore, I would like to be able to check if the numbers are:

  1. sequential given the range that presently exists in the column.
  2. if there are any duplicates

I would also like to output these results:

SKIPPED:
xxxx0007

DUPLICATES:
xxxx0008

The furthest I have been able to get is to use awk to get the column I need:

cat <file.txt> | awk '{ print $5 }'

which gets me to this:

NAME
xxxx0001
xxxx0002
xxxx0003
xxxx0004
xxxx0005
xxxx0006
xxxx0008
xxxx0008

But I do not know where to go from here.

Do I need to loop through the list items and parse so I get the number only, then start doing some comparisons to the next line?

Any help would be tremendously appreciated Thank you!


Solution

  • As a starting point, please try the following:

    awk '
    NR>1 { gsub("[^0-9]", "", $5); count[$5]++ }
    END {
        print "Skipped:"
        for (i=1; i<NR; i++)
            if (count[i] == 0) printf "xxxx%04d\n", i
        print "Duplicates:"
        for (i=1; i<NR; i++)
            if (count[i] > 1) printf "xxxx%04d\n", i
    } ' file.txt
    

    Output:

    Skipped:
    xxxx0007
    Duplicates:
    xxxx0008
    
    • The condition NR>1 is used to skip the top header line.
    • gsub("[^0-9]", "", $5) removes non-number characters from $5. As a result, $5 is set to a number extracted from the 5th column.
    • The array count[] counts the occurances of each number. If the value is 0 (or undefined), it means the number is skipped. If the value is larger than 1, the number is duplicated.
    • The END { ... } block is executed after all the input lines are processed and it is useful to report the final results.

    However, the "Skipped/Duplicates" approach cannot well detect such cases as:

    #       LOCATION        TIME REFERENCE      UNITS       NAME            COMMENTS
    1       0:00.500        24000               Samples     xxxx0001
    2       0:02.888        138652              Samples     xxxx0003
    3       0:04.759        228446              Samples     xxxx0004
    4       0:07.050        338446              Samples     xxxx0005
    5       0:09.034        433672              Samples     xxxx0006
    6       0:12.061        578958              Samples     xxxx0007
    7       0:14.111        677333              Samples     xxxx0008
    8       0:17.253        828181              Samples     xxxx0009
    

    or

    #       LOCATION        TIME REFERENCE      UNITS       NAME            COMMENTS
    1       0:00.500        24000               Samples     xxxx0001
    2       0:02.888        138652              Samples     xxxx0003
    3       0:04.759        228446              Samples     xxxx0002
    4       0:07.050        338446              Samples     xxxx0004
    5       0:09.034        433672              Samples     xxxx0005
    6       0:12.061        578958              Samples     xxxx0006
    7       0:14.111        677333              Samples     xxxx0007
    8       0:17.253        828181              Samples     xxxx0008
    

    It will be better to perform a line-by-line comparison between expected value and the actual value. Then how about:

    awk '
    NR>1 {
        gsub("[^0-9]", "", $5)
        if ($5 != NR-1) printf "Line: %d  Expected: xxxx%04d  Actual: xxxx%04d\n", NR, NR-1, $5
    } ' file.txt
    

    output for the original example:

    Line: 8  Expected: xxxx0007  Actual: xxxx0008
    

    [EDIT]

    According to the revised input file which includes more extra header lines, how about:

    awk '
    f {
        gsub("[^0-9]", "", $5)
        if ($5 != NR-skip) printf "Line: %d  Expected: xxxx%04d  Actual: xxxx%04d\n", NR, NR-skip, $5
    }
    /^#[[:blank:]]+LOCATION[[:blank:]]+TIME REFERENCE/ {
        skip = NR
        f = 1
    }
    ' file.txt
    

    Output:

    Line: 19  Expected: xxxx0007  Actual: xxxx0008
    

    The script above skips the lines until the specific pattern # LOCATION TIME REFERENCE is found.

    • The f { ... } block is executed if f is true. So the block is skipped until f is set to a nonzero value.
    • The /^# .../ { ... } block is executed if the input line matches the pattern. If found, skip is set to the number of header lines and f (flag) is set to 1 so the upper block is executed from the next iteration.

    Hope this helps.