Search code examples
linuxbashmacosgrepwc

Removing uppercase/lowercase dupes in list


I put together a small script that is supposed to search through files of a certain type in a directory accumulating unique word counts > 4 characters, but it's not working as expected.

  1. It doesn't eliminate the same word regardless of case.
  2. I'm not sure how to tally up the totals of each word obviously.
  3. Lastly, is this an efficient way to do this (if it actually worked?).

Script:

#!/bin/bash

file_list=()
while IFS= read file ; do
    file_list=("${file_list[@]}" "$file")
    tr -sc 'A-Za-z' '\012' < "$file" | sort | uniq -c | egrep "\w{4,}" >> words.txt
done < <(find . -maxdepth 1 -type f -name "*.c")

# echo "${file_list[@]}"

cat words.txt | sort -u | sort -nr 
echo "" > words.txt

example output:

  38 char
  35 return
  25 static
  18 year
  18 char
  10 COLS
  10 CHAR

How would I remove the duped word char in the example above, but getting it's count in all files?


Solution

  • First, convert to all-lowercase as the first step in your pipeline.

    tr A-Z a-z <"$file" | tr -sc a-z '\012' | ...
    

    Second, do the sorting and counting at the end of the whole thing instead of inside the loop:

    ...
      tr A-Z a-z <"$file" | tr -sc a-z '\012' 
    done < <(find ...) | sort | uniq -c | egrep "\w{4,}" >words.txt