Search code examples
regexreplaceduplicatesnotepad++

Remove duplicate lines containing same starting text


So I have a massive list of numbers where all lines contain the same format.

#976B4B|B|0|0
#970000|B|0|1
#974B00|B|0|2
#979700|B|0|3
#4B9700|B|0|4
#009700|B|0|5
#00974B|B|0|6
#009797|B|0|7
#004B97|B|0|8
#000097|B|0|9
#4B0097|B|0|10
#970097|B|0|11
#97004B|B|0|12
#970000|B|0|13
#974B00|B|0|14
#979700|B|0|15
#4B9700|B|0|16
#009700|B|0|17
#00974B|B|0|18
#009797|B|0|19
#004B97|B|0|20
#000097|B|0|21
#4B0097|B|0|22
#970097|B|0|23
#97004B|B|0|24
#2C2C2C|B|0|25
#979797|B|0|26
#676767|B|0|27
#97694A|B|0|28
#020202|B|0|29
#6894B4|B|0|30
#976B4B|B|0|31
#808080|B|1|0
#800000|B|1|1
#803F00|B|1|2
#808000|B|1|3

What I am trying to do is remove all duplicate lines that contain the same hex codes, regardless of the text after it.

Example, in the first line #976B4B|B|0|0 the hex #976B4B shows up in line 32 as #976B4B|B|0|31. I want all lines EXCEPT the first occurrence to be removed.

I have been attempting to use regex to solve this, and found ^(.*)(\r?\n\1)+$ $1 can remove duplicate lines but obviously not what I need. Looking for some guidance and maybe a possibility to learn from this.


Solution

  • You can use the following regex replacement, make sure you click Replace All as many times as necessary, until no match is found:

    Find What: ^((#[[:xdigit:]]+)\|.*(?:\R.+)*?)\R\2\|.*
    Replace With: $1

    See the regex demo and the demo screenshot:

    enter image description here

    Details:

    • ^ - start of a line
    • ((#[[:xdigit:]]+)\|.*(?:\R.+)*?) - Group 1 ($1, it will be kept):
      • (#[[:xdigit:]]+) - Group 2: # and one or more hex chars
      • \| - a | char
      • .* - the rest of the line
      • (?:\R.+)*? - any zero or more non-empty lines (if they can be empty, replace .+ with .*)
    • \R\2\|.* - a line break, Group 2 value, | and the rest of the line.