Search code examples
bashubuntucommand-linepalindrome

Counting palindromes in a text file


Having followed this thread BASH Finding palindromes in a .txt file I can't figure out what am I doing wrong with my script.

#!/bin/bash
search() {
tr -d '[[:punct:][:digit:]@]' \
| sed -E -e '/^(.)\1+$/d'      \
| tr -s '[[:space:]]'           \
| tr '[[:space:]]' '\n'
}

search "$1"

paste <(search <"$1") <(search < "$1" | rev)     \
| awk '$1 == $2 && (length($1) >=3) { print $1 }' \
| sort | uniq -c

All im getting from this script is output of the whole text file. I want to only output palindromes >=3 and count them such as

425 did

120 non

etc. My textfile is called sample.txt and everytime i run the script with: cat sample.txt | source palindrome I get message 'bash: : No such file or directory'.


Solution

  • Using awk and sed

    awk 'function palindrome(str) {len=length(str); for(k=1; k<=len/2+len%2; k++) { if(substr(str,k,1)!=substr(str,len+1-k,1)) return 0 } return 1 } {for(i=1; i<=NF; i++) {if(length($i)>=3){ gsub(/[^a-zA-Z]/,"",$i); if(length($i)>=3) {$i=tolower($i); if(palindrome($i)) arr[$i]++ }} } } END{for(i in arr) print arr[i],i}' file | sed -E '/^[0-9]+ (.)\1+$/d'
    

    Tested on 1.2GB file and execution time was ~4m 40s (i5-6440HQ @ 2.60GHz/4 cores/16GB)

    Explanation :

    awk '
        function palindrome(str)               # Function to check Palindrome
        {
            len=length(str); 
            for(k=1; k<=len/2+len%2; k++) 
            { 
                if(substr(str,k,1)!=substr(str,len+1-k,1)) 
                return 0 
            } 
            return 1 
        } 
    
        {
            for(i=1; i<=NF; i++)               # For Each field in a record
            {
                if(length($i)>=3)              # if length>=3
                { 
                    gsub(/[^a-zA-Z]/,"",$i);   # remove non-alpha character from it
                    if(length($i)>=3)          # Check length again after removal
                    {
                        $i=tolower($i);        # Covert to lowercase
                        if(palindrome($i))     # Check if it's palindrome
                            arr[$i]++          # and store it in array
                    }
                }
            } 
        } 
    
        END{for(i in arr) print arr[i],i}' file | sed -E '/^[0-9]+ (.)\1+$/d' 
    

    sed -E '/^[0-9]+ (.)\1+$/d' : From the final result check which strings are composed of just repeated chracters like AAA, BBB etc and remove them.


    Old Answer (Before EDIT)

    You can try below steps if you want to :

    Step 1 : Pre-processing
    Remove all unnecessary chars and store the result in temp file

    tr -dc 'a-zA-Z\n\t ' <file | tr ' ' '\n' > temp
    

    tr -dc 'a-zA-Z\n\t ' This will remove all except letters,\n,\t, space

    tr ' ' '\n' This will convert space to \n to separate each word in newlines

    Step-2: Processing

    grep -wof temp <(rev temp)  | sed -E -e '/^(.)\1+$/d' | awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }'
    

    grep -wof temp <(rev temp) This will give you all palindromes
    -w : Select only those lines containing matches that form whole words. For example : level won't match with levelAAA
    -o : Print only the matched group
    -f : To use each string in temp file as pattern to search in <(rev temp)

    sed -E -e '/^(.)\1+$/d': This will remove words formed of same letters like AAA, BBBBB

    awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }' : This will filter words having length>=3 and counts their frequency and finally prints the result

    Example :

    Input File :

    $ cat file
    kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
    kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
    kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
    

    Output:

    $ tr -dc 'a-zA-Z\n\t ' <file | tr ' ' '\n' > temp
    $ grep -wof temp <(rev temp)  | sed -E -e '/^(.)\1+$/d' | awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }' 
    3 dad
    3 kayak
    3 bob