Search code examples
ubuntuuniquewords

how to generate list of (unique) words from text file in ubuntu?


I have an ASCII text file. I want to generate a list of all "words" from that file using one or more Ubuntu commands. A word is defined as an alpha-num sequence between delimiters. Delimiters are by default whitespaces but I also want to experiment with other characters like punctuation etc. IN other words, i want to be able to specify a delimiter char set. How do I produce only a unique set of words? What if I also want to list only those words that are at least N characters long?


Solution

  • You could use grep:

    -E '\w+' searches for words

    -o only prints the portion of the line that matches % cat temp

    Some examples use "The quick brown fox jumped over the lazy dog," rather than "Lorem ipsum dolor sit amet, consectetur adipiscing elit" for example text.

    if you don't care whether words repeat

    % grep -o -E '\w+' temp
    Some
    examples
    use
    The
    quick
    brown
    fox
    jumped
    over
    the
    lazy
    dog
    rather
    than
    Lorem
    ipsum
    dolor
    sit
    amet
    consectetur
    adipiscing
    elit
    for
    example
    text
    

    If you want to only print each word once, disregarding case, you can use sort

    -u only prints each word once

    -f tells sort to ignore case when comparing words

    if you only want each word once

    % grep -o -E '\w+' temp | sort -u -f
    adipiscing
    amet
    brown
    consectetur
    dog
    dolor
    elit
    example
    examples
    for
    fox
    ipsum
    jumped
    lazy
    Lorem
    over
    quick
    rather
    sit
    Some
    text
    than
    The
    use
    

    you can also use the tr command

    echo the quick brown fox jumped over the lazydog | tr -cs 'a-zA-Z0-9' '\n'
    the
    quick
    brown
    fox
    jumped
    over
    the
    lazydog
    

    The -c is for the complement of the specified characters; the -s squeezes out duplicates of the replacements; the 'a-zA-Z0-9' is the set of alphanumerics, if you add a character here, the input won't get delimited on that character (see another example below); the '\n' is the replacement character (newline).

    echo the quick brown fox jumped over the lazy-dog | tr -cs 'a-zA-Z0-9-' '\n'
    the
    quick
    brown
    fox
    jumped
    over
    the
    lazy-dog
    

    As we added '-' in the list of non-delimiters list, lazy-dog was printed. Other the output is

    echo the quick brown fox jumped over the lazy-dog | tr -cs 'a-zA-Z0-9' '\n'
    the
    quick
    brown
    fox
    jumped
    over
    the
    lazy
    dog
    

    Summary for tr: any character not in argument of -c, will act as a delimiter. I hope this solves your delimiter problem too.