Generating a sorted list of three to five letter words

This is the assignment:

Write a script that will generate a single sorted list of three- to five-letter words. Input text will be generated from the on-line ls manual pages (output from ’man ls’ command).

This is my code so far:

man ls | sed '!s/ //g' | tr 'A-Z' 'a-z' | tr -s '\040' '\012' | sort | uniq -u

Here is where I get stuck. We are provided with steps to get the desired result however I am having trouble figuring out the proper grep command. These are the directions:

Using a single grep command, extract the 3 to 5-letter words. Keep in mind that each “word” is now on its own line. You will need to use a regular expression that specifies the whole line (not just a pattern found somewhere in the line). We know that the asterisk represents “zero or more of the previous pattern.” What regular expression is used to represent “from three to five instances of the previous pattern on a line by itself?”. [ Whole line match? You have regurar expression “anchor points” that specify the beginning and end of the line. Use them!

I think it should look something like this, but it doesn't work.

grep '{3,5}'

EXTRA INFORMATION

Filter out all characters except spaces and alpha characters (A-Za-z). You can do this by using the stream editor (sed) to remove (substitute with nothing) all characters not in that set. Hint: How do you specify the regular expression to match a single character that is not an alpha or space character? 2. To avoid duplicates, convert all the letters to the same case. The translate command (tr) should be used to do this (see page 83 of the textbook). For example, ‘The’ and ‘the’ need to be treated as the same word. By making all the text the same case (either upper or lower), you will avoid listing the same mixed-case word more than once. 3. Modify the remaining text such that each “word” is placed on its own line. Use the tr command to convert all spaces to newlines. Every single “word” is now on a line by itself. Don’t worry about the empty lines. They’ll get filtered out later. 4. Use the sort command to sort the lines (“words”). Is there an option we can use with sort to remove duplicate lines? Use that option. 5. Using a single grep command, extract the 3 to 5-letter words. Keep in mind that each “word” is now on its own line. You will need to use a regular expression that specifies the whole line (not just a pattern found somewhere in the line). We know that the asterisk represents “zero or more of the previous pattern.” What regular expression is used to represent “from three to five instances of the previous pattern on a line by itself?”. [ Whole line match? You have regurar expression “anchor points” that specify the beginning and end of the line. Use them! ]

Solution

1) Filter out all characters except spaces and alpha characters (A-Za-z). You can do this by using the stream editor (sed) to remove (substitute with nothing) all characters not in that set. Hint: How do you specify the regular expression to match a single character that is not an alpha or space character?

What your teacher probably expects:

sed 's/[^A-Za-z ]//g'

The right way:

sed -r 's/[^[:alpha:][:space:]]+//g'

2) To avoid duplicates, convert all the letters to the same case. The translate command (tr) should be used to do this (see page 83 of the textbook). For example, ‘The’ and ‘the’ need to be treated as the same word. By making all the text the same case (either upper or lower), you will avoid listing the same mixed-case word more than once.

Your teacher:

tr 'A-Z' 'a-z'

The right way:

tr '[:upper:]' '[:lower:]'

3) Modify the remaining text such that each “word” is placed on its own line. Use the tr command to convert all spaces to newlines. Every single “word” is now on a line by itself. Don’t worry about the empty lines. They’ll get filtered out later.

Your teacher:

tr ' ' '
'

better::

tr '[:blank:]' "$'\n'"

4) Use the sort command to sort the lines (“words”). Is there an option we can use with sort to remove duplicate lines? Use that option.

sort -u

5) Using a single grep command, extract the 3 to 5-letter words. Keep in mind that each “word” is now on its own line. You will need to use a regular expression that specifies the whole line (not just a pattern found somewhere in the line). We know that the asterisk represents “zero or more of the previous pattern.” What regular expression is used to represent “from three to five instances of the previous pattern on a line by itself?”. [ Whole line match? You have regular expression “anchor points” that specify the beginning and end of the line. Use them! ]

Teacher:

grep -E '^[a-z]{3,5}$'

Better:

grep -E '^[[:alpha:]]{3,5}$'

Now, figure out which of each of the commands above your notes actually support you using, the differences between them and glue them together with pipes. Good Luck!

BTW here's how you'd do it in UNIX with one command instead of multiple in a pipe, in this case using GNU awk for sorted arrays, with other awks just pipe to sort:

$ man ls | awk '
        {
            gsub(/[^[:alpha:][:space:]]+/," ")
            $0=$0
            for (i=1;i<=NF;i++)
               if ($i ~ /.{3,5}/)
                   words[$i]
        }
        END {
            PROCINFO["sorted_in"]="@ind_str_asc"
            for (word in words)
                print word
        }'