Search code examples
regexsortingcygwintr

RegEx: find every string with two or more letters


im trying to sort a text regarding to its frequency of certain cluster of consonants in cygwin.

the command first used is:

tr 'a-zöäü' 'A-ZÖÄÜ' < text.txt | tr -sc 'BCDFGHJKLMNPQRSTVWXYZ' '\n' | 
sort | uniq -c | sort -nr

what i think it does:

translate all lower to uppercase, eliminate everything not matching the first regex and print a new line after every string.

it gave me a list like this:

300 N
181 R
157 D
116 S
 91 T
 82 G
 81 M
 69 B
 65 ND

which is already pretty nice, BUT im only interested in clusters of two ore more letters (so the first match which would be interesting for me would be 'ND'). now im trying to elimate every string with less then two letters.

what i tried:

 tr 'a-zöäü' 'A-ZÖÄÜ' < text.txt | tr -sc [BCDFGHJKLMNPQRSTVWXYZ]{2} '\n' | 
 sort | uniq -c | sort -nr

because i thought that adding {2} would match any combination of consonants and shut out the single letters thrashing my list (N,R,D..) - but actually it didn't change anything, the list stayed the same.

anyone can help me out?

thanks already.


Solution

  • You could post-process with grep:

    ... | grep -E '[[:digit:]]+ [[:alnum:]]{2,}$'
    

    That'll show just lines ending with two or more characters and their preceding digits.