Search code examples
bashfiltercharacterline

Is there a Bash function that allow me to separate/delete/isolate line from a file when they have the same first word


I have a text file like this:

id ; lorem ipsum  fgdg df gdg
id ; lorem ipsum  fgdg df gdg
id ; lorem ipsum  fgdg df gdg
id ; lorem ipsum  fgdg df gdg
id ; lorem ipsum  fgdg df gdg

And if 2 id are similar, I want to separate to line where 2 id are similar and the line that are unique.

uniquefile contains the lines with unique id. notuniquefile contains the lines that don't have one.

I already found a way to almost do it but only with the first word. Basically it is just isolating the id and deleting the rest of line.

Command 1: isolating unique id (but missing the line):

awk -F ";" '{!seen[$1]++};END{for(i in seen) if(seen[i]==1)print i }' originfile >> uniquefile

Command 2: isolating the not unique id (but missing the line and losing the "lorem ipsum" content that can be different depending on the line):

awk -F ":" '{!seen[$1]++;!ligne$0};END{for(i in seen) if(seen[i]>1)print i  }' originfile >> notuniquefile

So in a perfect world I would like you to help me obtain this type of result:

originfile:

1 ; toto
2 ; toto
3 ; toto
3 ; titi
4 ; titi

uniquefile:

1 ; toto
2 ; toto
4 ; titi

notuniquefile:

3 ; toto
3 ; titi

Have a good day.


Solution

  • Yet another method with just two unix commands, that works if your id fields always have the same length (let's assume they are one character in length like in my testdata, but it of course works also for longer fields):

    # feed the testfile.txt sorted to uniq
    # -w means: only compare the first 1 character of each line
    # -D means: output only duplicate lines (fully not just one per group)
    sort testfile.txt | uniq -w 1 -D > duplicates.txt
    
    # then filter out all duplicate lines from the text file
    # to just let the unique files slip through
    # -v means: negate the pattern
    # -F means: use fixed strings instead of regex
    # -f means: load the patterns from a file
    grep -v -F -f duplicates.txt testfile.txt > unique.txt
    

    And the output is (for the same input lines as used in my other post):

    $uniq -w 2 -D  testfile.txt 
    2;line B
    2;line C
    3;line D
    3;line E
    3;line F
    

    and:

    $ grep -v -F -f duplicates.txt testfile.txt 
    1;line A
    4;line G
    

    Btw. in case you want to avoid the grep, you can also store the output of the sort (lets say in sorted_file.txt) and replace the second line by

    uniq -w 1 -u sorted_file.txt > unique.txt
    

    where the number behind -w again is the length of your id field in characters.