I have a file that is the output of a compute-intensive process that has experienced some kind of error that creates a large number of duplicate lines. However, some of the duplicates are correct and required to correctly parse the output. I can tell the two apart because these duplicate lines are correct only if they are consecutive and begin with a left curly brace. The order of the file is also important to maintain in order for it to be parsed correctly. Using bash, awk, or other command-line scripting, how do I delete only non-consecutive duplicate lines while retaining the original line order?
Sample "good" duplicates:
...
[0.799609,0.8016), indices: 254317-53689123
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
...
Sample "bad" duplicates:
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
persistent homology intervals in dim 0:
persistent homology intervals in dim 0:
[0, ): {[879]}
[0, ): {[879]}
I have tried this solution and seen this one, but this of course deletes all duplicate lines and does not preserve duplications of the type that I need to retain. I also have seen a solution that deletes only consecutive lines, but none that do the inverse.
Sample input:
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
persistent homology intervals in dim 0:
persistent homology intervals in dim 0:
[0, ): {[879]}
[0, ): {[879]}
persistent homology intervals in dim 1:
[0.69551,0.75602), indices: 114992-47779123
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381)}
persistent homology intervals in dim 1:
[0.69551,0.75602), indices: 114992-47779123
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381)}
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381), [31,220] (0.729510)}
[0.799609,0.8016), indices: 254317-53689123
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
Sample output:
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
persistent homology intervals in dim 0:
[0, ): {[879]}
persistent homology intervals in dim 1:
[0.69551,0.75602), indices: 114992-47779123
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381)}
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381), [31,220] (0.729510)}
[0.799609,0.8016), indices: 254317-53689123
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
If a line is encountered both in isolation and as part of a consecutive pair of the correct form, then I wish to keep the ones that shows up as a pair. However, I don't think it's known which one will come up first in every case.
Thank you so much for your help!
If I'm understanding the requirement correctly, the logic will be:
{
should be specially treated:
{
, the lines are treated as good
overriding other conditions.{
duplicates either backward or forward, the line shoud be dropped.{
can be handled with the common logic to drop duplicated lines.Then two-pass processing will work:
awk '
NR==FNR { # pass-1
if (/^\{/) { # starting with "{"
seen1[$0]++ # mark it
if (prev ~ /^\{/) {good[FNR - 1]++; good[FNR]++} # treat consecutive lines as "good"
}
prev = $0 # remember current line
next # skip pass-2
}
(/^\{/ && (good[FNR] || !seen1[$0])) || (!/^\{/ && !seen2[$0]++) # pass-2: print the lines which meet the condition
' file file