I put together a small script that is supposed to search through files of a certain type in a directory accumulating unique word counts > 4 characters, but it's not working as expected.
Script:
#!/bin/bash
file_list=()
while IFS= read file ; do
file_list=("${file_list[@]}" "$file")
tr -sc 'A-Za-z' '\012' < "$file" | sort | uniq -c | egrep "\w{4,}" >> words.txt
done < <(find . -maxdepth 1 -type f -name "*.c")
# echo "${file_list[@]}"
cat words.txt | sort -u | sort -nr
echo "" > words.txt
example output:
38 char
35 return
25 static
18 year
18 char
10 COLS
10 CHAR
How would I remove the duped word char
in the example above, but getting it's count in all files?
First, convert to all-lowercase as the first step in your pipeline.
tr A-Z a-z <"$file" | tr -sc a-z '\012' | ...
Second, do the sorting and counting at the end of the whole thing instead of inside the loop:
...
tr A-Z a-z <"$file" | tr -sc a-z '\012'
done < <(find ...) | sort | uniq -c | egrep "\w{4,}" >words.txt