Search code examples
sortingtextduplicatestext-processinguniq

Why isn't sort -u or uniq removing duplicates in concatenated text files?


I am trying to write a bash script to take three user dictionaries from various places across my boxen, and combine them, remove duplicates and then write them back to their respective areas.

However, when I cat the files, and either perform a sort -u or a uniq, the duplicate lines remain:

Alastair
Alastair
Albanese
Albanese
Alberts
Alberts
Alec
Alec
Alex
Alex

I narrowed it down to one of the files, which comes from Microsoft Outlook/Windows and is called CUSTOM.DIC. By examining it with file -i I found that it was a UTF-16le file (and was printing oriental characters when concatenated with UTF-8 files directly), so I ran the command

iconv -f utf-16le -t utf-8 CUSTOM.DIC -o CUSTOMUTF8.DIC

Yet, when I concatenate that file with my other UTF-8 files, it produces duplicates that cannot be removed using sort -u or uniq.

I have found that for large files, file -i only guesses the file format from the first (many) thousand lines, so I ran the commands

file_to_check="CUSTOMUTF8.DIC"
bytes_to_scan=$(wc -c < $file_to_check)
file -b --mime-encoding -P bytes=$bytes_to_scan $file_to_check

with the output:

Output code of file type check

so the conversion has happened, the output file combined.txt is UTF-8 also, so why can't I remove the duplicate lines?

combined output file showing duplicates

I have also checked to see if there are any trailing spaces in the combined file.

This feels like a problem that many people would have seen before, but I can't find the answer (or I've created the wrong search string, of course)...


Solution

  • Many thanks to @Andrew Henle - I knew it would be something simple!

    Indeed, using hexdump -c combined2.txt I saw that some lines ended with a \n and some with \r\n.

    So I downloaded dos2unix and ran

    dos2unix combined2.txt
    sort -u combine2.txt > combined3.txt
    

    and it's all good!

    Thanks again, Andrew!