bash awk command-line number-formatting string-parsing

Check to see if numbers in a column are sequential via command line

In a text file, I have a sequence of numbers in a column preceded by a short string. It is the 5th column in the example file here under "NAME":

SESSION NAME:   session
SAMPLE RATE:    48000.000000
BIT DEPTH:  16-bit
SESSION START TIMECODE: 00:00:00:00.00
TIMECODE FORMAT:    24 Frame
# OF AUDIO TRACKS:  2
# OF AUDIO CLIPS:   2
# OF AUDIO FILES:   2


M A R K E R S  L I S T I N G
#       LOCATION        TIME REFERENCE      UNITS       NAME                                COMMENTS
2       0:00.500        24000               Samples     xxxx0001                            
3       0:03.541        170000              Samples     xxxx0002                            
4       0:05.863        281458              Samples     xxxx0003                            
5       0:08.925        428430              Samples     xxxx0004                            
6       0:10.604        509025              Samples     xxxx0005                            
7       0:13.973        670742              Samples     xxxx0006                            
8       0:15.592        748453              Samples     xxxx0008                            
9       0:19.243        923666              Samples     xxxx0008

In the example above, 0007 is missing, and 0008 is duplicated.

Therefore, I would like to be able to check if the numbers are:

sequential given the range that presently exists in the column.

if there are any duplicates

I would also like to output these results:

SKIPPED:
xxxx0007

DUPLICATES:
xxxx0008

The furthest I have been able to get is to use awk to get the column I need:

cat <file.txt> | awk '{ print $5 }'

which gets me to this:

NAME
xxxx0001
xxxx0002
xxxx0003
xxxx0004
xxxx0005
xxxx0006
xxxx0008
xxxx0008

But I do not know where to go from here.

Do I need to loop through the list items and parse so I get the number only, then start doing some comparisons to the next line?

Any help would be tremendously appreciated Thank you!

Solution

As a starting point, please try the following:

awk '
NR>1 { gsub("[^0-9]", "", $5); count[$5]++ }
END {
    print "Skipped:"
    for (i=1; i<NR; i++)
        if (count[i] == 0) printf "xxxx%04d\n", i
    print "Duplicates:"
    for (i=1; i<NR; i++)
        if (count[i] > 1) printf "xxxx%04d\n", i
} ' file.txt

Output:

Skipped:
xxxx0007
Duplicates:
xxxx0008

The condition NR>1 is used to skip the top header line.
gsub("[^0-9]", "", $5) removes non-number characters from $5. As a result, $5 is set to a number extracted from the 5th column.
The array count[] counts the occurances of each number. If the value is 0 (or undefined), it means the number is skipped. If the value is larger than 1, the number is duplicated.
The END { ... } block is executed after all the input lines are processed and it is useful to report the final results.

However, the "Skipped/Duplicates" approach cannot well detect such cases as:

#       LOCATION        TIME REFERENCE      UNITS       NAME            COMMENTS
1       0:00.500        24000               Samples     xxxx0001
2       0:02.888        138652              Samples     xxxx0003
3       0:04.759        228446              Samples     xxxx0004
4       0:07.050        338446              Samples     xxxx0005
5       0:09.034        433672              Samples     xxxx0006
6       0:12.061        578958              Samples     xxxx0007
7       0:14.111        677333              Samples     xxxx0008
8       0:17.253        828181              Samples     xxxx0009

#       LOCATION        TIME REFERENCE      UNITS       NAME            COMMENTS
1       0:00.500        24000               Samples     xxxx0001
2       0:02.888        138652              Samples     xxxx0003
3       0:04.759        228446              Samples     xxxx0002
4       0:07.050        338446              Samples     xxxx0004
5       0:09.034        433672              Samples     xxxx0005
6       0:12.061        578958              Samples     xxxx0006
7       0:14.111        677333              Samples     xxxx0007
8       0:17.253        828181              Samples     xxxx0008

It will be better to perform a line-by-line comparison between expected value and the actual value. Then how about:

awk '
NR>1 {
    gsub("[^0-9]", "", $5)
    if ($5 != NR-1) printf "Line: %d  Expected: xxxx%04d  Actual: xxxx%04d\n", NR, NR-1, $5
} ' file.txt

output for the original example:

Line: 8  Expected: xxxx0007  Actual: xxxx0008

[EDIT]

According to the revised input file which includes more extra header lines, how about:

awk '
f {
    gsub("[^0-9]", "", $5)
    if ($5 != NR-skip) printf "Line: %d  Expected: xxxx%04d  Actual: xxxx%04d\n", NR, NR-skip, $5
}
/^#[[:blank:]]+LOCATION[[:blank:]]+TIME REFERENCE/ {
    skip = NR
    f = 1
}
' file.txt

Output:

Line: 19  Expected: xxxx0007  Actual: xxxx0008

The script above skips the lines until the specific pattern # LOCATION TIME REFERENCE is found.

The f { ... } block is executed if f is true. So the block is skipped until f is set to a nonzero value.
The /^# .../ { ... } block is executed if the input line matches the pattern. If found, skip is set to the number of header lines and f (flag) is set to 1 so the upper block is executed from the next iteration.

Hope this helps.