In a text file, I have a sequence of numbers in a column preceded by a short string. It is the 5th column in the example file here under "NAME":
SESSION NAME: session
SAMPLE RATE: 48000.000000
BIT DEPTH: 16-bit
SESSION START TIMECODE: 00:00:00:00.00
TIMECODE FORMAT: 24 Frame
# OF AUDIO TRACKS: 2
# OF AUDIO CLIPS: 2
# OF AUDIO FILES: 2
M A R K E R S L I S T I N G
# LOCATION TIME REFERENCE UNITS NAME COMMENTS
2 0:00.500 24000 Samples xxxx0001
3 0:03.541 170000 Samples xxxx0002
4 0:05.863 281458 Samples xxxx0003
5 0:08.925 428430 Samples xxxx0004
6 0:10.604 509025 Samples xxxx0005
7 0:13.973 670742 Samples xxxx0006
8 0:15.592 748453 Samples xxxx0008
9 0:19.243 923666 Samples xxxx0008
In the example above, 0007 is missing, and 0008 is duplicated.
Therefore, I would like to be able to check if the numbers are:
- sequential given the range that presently exists in the column.
- if there are any duplicates
I would also like to output these results:
SKIPPED:
xxxx0007
DUPLICATES:
xxxx0008
The furthest I have been able to get is to use awk
to get the column I need:
cat <file.txt> | awk '{ print $5 }'
which gets me to this:
NAME
xxxx0001
xxxx0002
xxxx0003
xxxx0004
xxxx0005
xxxx0006
xxxx0008
xxxx0008
But I do not know where to go from here.
Do I need to loop through the list items and parse so I get the number only, then start doing some comparisons to the next line?
Any help would be tremendously appreciated Thank you!
As a starting point, please try the following:
awk '
NR>1 { gsub("[^0-9]", "", $5); count[$5]++ }
END {
print "Skipped:"
for (i=1; i<NR; i++)
if (count[i] == 0) printf "xxxx%04d\n", i
print "Duplicates:"
for (i=1; i<NR; i++)
if (count[i] > 1) printf "xxxx%04d\n", i
} ' file.txt
Output:
Skipped:
xxxx0007
Duplicates:
xxxx0008
NR>1
is used to skip the top header line.gsub("[^0-9]", "", $5)
removes non-number characters from $5
.
As a result, $5
is set to a number extracted from the 5th column.count[]
counts the occurances of each number. If the value
is 0
(or undefined), it means the number is skipped. If the value
is larger than 1
, the number is duplicated.END { ... }
block is executed after all the input lines are processed
and it is useful to report the final results.However, the "Skipped/Duplicates" approach cannot well detect such cases as:
# LOCATION TIME REFERENCE UNITS NAME COMMENTS
1 0:00.500 24000 Samples xxxx0001
2 0:02.888 138652 Samples xxxx0003
3 0:04.759 228446 Samples xxxx0004
4 0:07.050 338446 Samples xxxx0005
5 0:09.034 433672 Samples xxxx0006
6 0:12.061 578958 Samples xxxx0007
7 0:14.111 677333 Samples xxxx0008
8 0:17.253 828181 Samples xxxx0009
or
# LOCATION TIME REFERENCE UNITS NAME COMMENTS
1 0:00.500 24000 Samples xxxx0001
2 0:02.888 138652 Samples xxxx0003
3 0:04.759 228446 Samples xxxx0002
4 0:07.050 338446 Samples xxxx0004
5 0:09.034 433672 Samples xxxx0005
6 0:12.061 578958 Samples xxxx0006
7 0:14.111 677333 Samples xxxx0007
8 0:17.253 828181 Samples xxxx0008
It will be better to perform a line-by-line comparison between expected value and the actual value. Then how about:
awk '
NR>1 {
gsub("[^0-9]", "", $5)
if ($5 != NR-1) printf "Line: %d Expected: xxxx%04d Actual: xxxx%04d\n", NR, NR-1, $5
} ' file.txt
output for the original example:
Line: 8 Expected: xxxx0007 Actual: xxxx0008
[EDIT]
According to the revised input file which includes more extra header lines, how about:
awk '
f {
gsub("[^0-9]", "", $5)
if ($5 != NR-skip) printf "Line: %d Expected: xxxx%04d Actual: xxxx%04d\n", NR, NR-skip, $5
}
/^#[[:blank:]]+LOCATION[[:blank:]]+TIME REFERENCE/ {
skip = NR
f = 1
}
' file.txt
Output:
Line: 19 Expected: xxxx0007 Actual: xxxx0008
The script above skips the lines until the specific pattern # LOCATION TIME REFERENCE
is found.
f { ... }
block is executed if f
is true. So the block is skipped
until f
is set to a nonzero value./^# .../ { ... }
block is executed if the input line matches the
pattern. If found, skip
is set to the number of header lines and
f
(flag) is set to 1 so the upper block is executed from the next
iteration. Hope this helps.