I have a text file like this:
id ; lorem ipsum fgdg df gdg
id ; lorem ipsum fgdg df gdg
id ; lorem ipsum fgdg df gdg
id ; lorem ipsum fgdg df gdg
id ; lorem ipsum fgdg df gdg
And if 2 id are similar, I want to separate to line where 2 id are similar and the line that are unique.
uniquefile
contains the lines with unique id.
notuniquefile
contains the lines that don't have one.
I already found a way to almost do it but only with the first word. Basically it is just isolating the id and deleting the rest of line.
Command 1: isolating unique id (but missing the line):
awk -F ";" '{!seen[$1]++};END{for(i in seen) if(seen[i]==1)print i }' originfile >> uniquefile
Command 2: isolating the not unique id (but missing the line and losing the "lorem ipsum" content that can be different depending on the line):
awk -F ":" '{!seen[$1]++;!ligne$0};END{for(i in seen) if(seen[i]>1)print i }' originfile >> notuniquefile
So in a perfect world I would like you to help me obtain this type of result:
originfile
:
1 ; toto
2 ; toto
3 ; toto
3 ; titi
4 ; titi
uniquefile
:
1 ; toto
2 ; toto
4 ; titi
notuniquefile
:
3 ; toto
3 ; titi
Have a good day.
Yet another method with just two unix commands, that works if your id fields always have the same length (let's assume they are one character in length like in my testdata, but it of course works also for longer fields):
# feed the testfile.txt sorted to uniq
# -w means: only compare the first 1 character of each line
# -D means: output only duplicate lines (fully not just one per group)
sort testfile.txt | uniq -w 1 -D > duplicates.txt
# then filter out all duplicate lines from the text file
# to just let the unique files slip through
# -v means: negate the pattern
# -F means: use fixed strings instead of regex
# -f means: load the patterns from a file
grep -v -F -f duplicates.txt testfile.txt > unique.txt
And the output is (for the same input lines as used in my other post):
$uniq -w 2 -D testfile.txt
2;line B
2;line C
3;line D
3;line E
3;line F
and:
$ grep -v -F -f duplicates.txt testfile.txt
1;line A
4;line G
Btw. in case you want to avoid the grep
, you can also store the output of the sort (lets say in sorted_file.txt) and replace the second line by
uniq -w 1 -u sorted_file.txt > unique.txt
where the number behind -w
again is the length of your id field in characters.