Search code examples
regexduplicatescshsimilarity

Delete similar lines csh


I've seen several articles on deleting duplicate lines, but I need something a little more specific. Here is an example of some raw data:

11111 AA 1  date1
11111 BB 64 date1
11111 BB 64 date2
...
11111 BB 64 date64
11111 BB 64 date1
11111 BB 64 date2
...
11111 BB 64 date64
11111 BB ## date1
11111 BB ## date2
...
11111 BB ## date##
22222 AA 1  date1
22222 BB 64 date1
22222 BB 64 date2
...
22222 BB 64 date64
22222 BB 64 date1
22222 BB 64 date2
...
22222 BB 64 date64
22222 BB ## date1
22222 BB ## date2
...
22222 BB ## date##

Note: Where ## is some number < 64.

I need to edit that file so it looks something like this:

11111 AA 1  date1
11111 BB 64 date1
11111 BB 64 date1
11111 BB ## date1
22222 AA 1  date1
22222 BB 64 date1
22222 BB 64 date1
22222 BB ## date1

I've seen several examples of using awk, sed, or ed along with regex to match the first part of a line. My confusion is with the occurance of the "BB 64" and "BB ##" and not just deleting all BB lines but the first.

Vital Info: Running this csh script on a Solaris v5.8

The AA lines are not important in this question except to know they are there (we are not doing anything with them).

Here's essentially what I've got so far (still having syntax issues from looking at examples using other languages, so if you can correct please do):

sed 'N;(\d{1,8}\sBB\s\d{1,2}.+\n);P;D' filename

If I were not getting errors due to syntax, I am sure this would delete all BB lines but the first "BB 64 date1." I think my sed regex above is based on uniq but only matches the frist part of the line instead of the entire line because I will need the first date of each BB (if there are more than 1 series of BB 64 for each 11111, 22222, etc the output should contain an identical BB 64 line for each series [just date1]). Any ideas?


Solution

  • Seems like sort -k4,4 | uniq would do the trick? (or sort +3 if the Solaris version is sufficiently old.)