I have a task to do which is to find some strings (acronyms) that repeat in some specific text file.
Here follows a sample:
...
the
the
het
het
het
teh
teh
teh
teh
...
In the first step, I can count how many times each one of that appears with this command:
cat text_file.txt | sort | uniq -c | sort -gr
And the output is something like this:
2 the
3 het
4 teh
But I need also to "count/sum" these three outputs because they are using the same three characters but in a different order.
Can you guys please give me some help about this?
With GNU awk for splitting a string into chars given a null FS and sorted_in:
$ cat tst.awk
{
split($0,chars,"")
PROCINFO["sorted_in"] = "@val_str_asc"
key = ""
for (i in chars) {
key = key chars[i]
}
cnt[key]++
}
END {
PROCINFO["sorted_in"] = "@ind_str_asc"
for (key in cnt) {
print key, cnt[key]
}
}
$ cat file
the
het
teh
foobar
fobar
oofrab
$ awk -f tst.awk file
abfoor 2
abfor 1
eht 3