Search code examples
linuxposixcarriage-returnuniq

Even after `sort`, `uniq` is still repeating some values


Reference file: http://snap.stanford.edu/data/wiki-Vote.txt.gz

(It is a tape archive that contains a file called Wiki-Vote.txt)

The first few lines in the file that contains the following, head -n 10 Wiki-Vote.txt

# Directed graph (each unordered pair of nodes is saved once): Wiki-Vote.txt 
# Wikipedia voting on promotion to administratorship (till January 2008). 
# Directed edge A->B means user A voted on B becoming Wikipedia administrator.
# Nodes: 7115 Edges: 103689
# FromNodeId    ToNodeId
     30          1412
     30          3352
     30          5254
     30          5543
     30          7478
     3            28

I want to find the number of nodes in the graph, (although it's already given in line 3). I ran the following command,

awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort | uniq | wc -l

Explanation:

  • /^#/ matches all the lines that start with #. And !/^#/ matches that doesn't.

  • awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt prints the first and second column of all those matched lines in new lines.

  • | sort pipes the output to sort them.

  • | uniq should display all those unique values, but it doesn't.

  • | wc -l counts the previous lines and it is wrong.

The result of the above command is, 8491, which is not 7115 (as mentioned in the line 3). I don't know why uniq repeats the values. I can tell that since awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort -i | uniq | tail returns,

992
993
993
994
994
995
996
998
999
999

Which contains the repeated values. Someone please run the code and tell me that I am not the only one getting the wrong answer and please help me figure out why I'm getting what I am getting.


Solution

  • The file has dos line endings - each line is ending with \r CR character.

    You can inspect your tail output for example with hexdump -C, lines starting with # added by me:

    $ awk '!/^#/ { print $1; print $2; }' ./wiki-Vote.txt | sort | uniq | tail | hexdump -C
    00000000  39 39 32 0a 39 39 33 0a  39 39 33 0d 0a 39 39 34  |992.993.993..994|
    #                                           ^^ HERE
    00000010  0a 39 39 34 0d 0a 39 39  35 0d 0a 39 39 36 0a 39  |.994..995..996.9|
    #                     ^^              ^^ 
    00000020  39 38 0a 39 39 39 0a 39  39 39 0d 0a              |98.999.999..|
    #                                        ^^
    0000002c
    

    Because uniq sees unique lines, one with CR and one not, they are not removed. Remove the CR character before pipeing. Note that sort | uniq is better to sort -u.

    $ awk '!/^#/ { print $1; print $2; }' ./wiki-Vote.txt | tr -d '\r' | sort -u | wc -l
    7115