Search code examples
linuxunixcygwin

Uniq but only on part of the string


I have strings such as:

import a.b.c.d.f.Class1
import a.b.g.d.f.Class2
import a.b.h.d.f.Class3
import z.y.x.d.f.Class4
import z.y.x.d.f.Class5
import z.y.x.d.f.Class6

I want to get all unique occurrences of the first part of the String. More specifically up to the third period. So I do:

grep "import curam" -hr --include \*.java | sort | gawk -F "." '{print $1"."$2"."$3}' | uniq

which gives me:

  import a.b.c
  import a.b.g
  import a.b.h
  import z.y.x

However, I'd like to get the full String for the first occurrence when the String up until the third period was unique. So, I want to get:

import a.b.c.d.f.Class1
import a.b.g.d.f.Class2
import a.b.h.d.f.Class3
import z.y.x.d.f.Class4

Any ideas?


Solution

  • Just keep track of the unique 2nd field:

    awk -F '[ .]' '!uniq[$2]++' file
    

    That is, start by setting the field separators to either a space or a dot. This way, the second field is always the first word in the dot-separated name:

    $ awk -F '[ .]' '{print $2}' file
    a
    a
    a
    z
    z
    z
    

    Then, just check when they appear for the first time:

    $ awk -F '[ .]' '!uniq[$2]++' file
    import a.b.c.d.f.Class1
    import z.y.x.d.f.Class4
    

    There are some subtle variations on the first three tokens between the String so I need to do just [.] Can't do space. I updated the question.

    So if you have:

    import a.b.c.d.f.Class1
    import a.b.g.d.f.Class2
    import a.b.h.d.f.Class3
    import z.y.x.d.f.Class4
    import z.y.x.d.f.Class5
    import z.y.x.d.f.Class6
    

    Then you need to split the second .-separeted field and check when the first three slices are repeated. This can be done using the same approach as above, only that using split() and then using the three first slices to check the uniqueness:

    $ awk '{split($2, a, ".")} !uniq[a[1] a[2] a[3]]++' file
    import a.b.c.d.f.Class1
    import a.b.g.d.f.Class2
    import a.b.h.d.f.Class3
    import z.y.x.d.f.Class4