Search code examples
bashawkduplicates

bash - delete only non-consecutive duplicate lines without changing file order


I have a file that is the output of a compute-intensive process that has experienced some kind of error that creates a large number of duplicate lines. However, some of the duplicates are correct and required to correctly parse the output. I can tell the two apart because these duplicate lines are correct only if they are consecutive and begin with a left curly brace. The order of the file is also important to maintain in order for it to be parsed correctly. Using bash, awk, or other command-line scripting, how do I delete only non-consecutive duplicate lines while retaining the original line order?

Sample "good" duplicates:

...
 [0.799609,0.8016), indices: 254317-53689123
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
...

Sample "bad" duplicates:

value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
persistent homology intervals in dim 0:
persistent homology intervals in dim 0:
 [0, ):  {[879]}
 [0, ):  {[879]}

I have tried this solution and seen this one, but this of course deletes all duplicate lines and does not preserve duplications of the type that I need to retain. I also have seen a solution that deletes only consecutive lines, but none that do the inverse.


Sample input:

value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
persistent homology intervals in dim 0:
persistent homology intervals in dim 0:
 [0, ):  {[879]}
 [0, ):  {[879]}
persistent homology intervals in dim 1:
 [0.69551,0.75602), indices: 114992-47779123
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381)}
persistent homology intervals in dim 1:
 [0.69551,0.75602), indices: 114992-47779123
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381)}
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381), [31,220] (0.729510)}
 [0.799609,0.8016), indices: 254317-53689123
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}

Sample output:

value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
persistent homology intervals in dim 0:
 [0, ):  {[879]}
persistent homology intervals in dim 1:
 [0.69551,0.75602), indices: 114992-47779123
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381)}
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381), [31,220] (0.729510)}
 [0.799609,0.8016), indices: 254317-53689123
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}

If a line is encountered both in isolation and as part of a consecutive pair of the correct form, then I wish to keep the ones that shows up as a pair. However, I don't think it's known which one will come up first in every case.

Thank you so much for your help!


Solution

  • If I'm understanding the requirement correctly, the logic will be:

    • The lines starting with { should be specially treated:
      • If two (or more?) consecutive lines start with {, the lines are treated as good overriding other conditions.
      • If a single (non-consecutive) line starting with { duplicates either backward or forward, the line shoud be dropped.
    • The lines not starting with { can be handled with the common logic to drop duplicated lines.

    Then two-pass processing will work:

    awk '
        NR==FNR {                                                           # pass-1
            if (/^\{/) {                                                    # starting with "{"
                seen1[$0]++                                                 # mark it
                if (prev ~ /^\{/) {good[FNR - 1]++; good[FNR]++}            # treat consecutive lines as "good"
            }
            prev = $0                                                       # remember current line
            next                                                            # skip pass-2
        }
        (/^\{/ && (good[FNR] || !seen1[$0])) || (!/^\{/ && !seen2[$0]++)    # pass-2: print the lines which meet the condition
    ' file file