Search code examples
bashawk

Comparing two arrays by first letters of each element and printing if equal


Let's say I have two arrays (A and B)

in directory A

#!/bin/bash
A=( $(ls *txt) )

directory A contains:

fox_abdce.txt
rabbit_abdce.txt
lemom_asnrndna.txt

in directory B

#!/bin/bash
B=( $(ls *txt) )

directory B contains:

fox_zzzzzz.txt
rabbit_zzzedd.txt
lemom_kokoijijim.txt

Or, input with type set (this could be generalized to anything similar)

#!/bin/bash
declare -a A=([0]="fox_abcde.txt" [1]="lemom_asnrndna.txt" [2]="rabbit_abcde.txt") 
declare -a B=([0]="fox_zzzzzz.txt" [1]="lemom_kokoijijim.txt" [2]="rabbit_zzzedd.txt")

I want to compare them to find out if all of them are similar by the first 3 letters

I would use AWK like this to find out if two columns from a csv file have the same initial three letters:

#!/bin/bash
export NUMBER_OF_DIGITS=3

matching

awk -F, '{if(substr($1, 1, $NUMBER_OF_DIGITS) == substr($2, 1, $NUMBER_OF_DIGITS)) print}' file.csv

Not matching

awk -F, '{if(substr($1, 1, $NUMBER_OF_DIGITS) != substr($2, 1, $NUMBER_OF_DIGITS)) print}' file.csv

How could I apply the same interrogation but using the arrays directly?

In this case the output should be anything with everything that matches

fox_abdce.txt
rabbit_abdce.txt
lemom_asnrndna.txt

fox_zzzzzz.txt
rabbit_zzzedd.txt
lemom_kokoijijim.txt

OR

fox_abdce.txt             fox_zzzzzz.txt
rabbit_abdce.txt          rabbit_zzzedd.txt
lemom_asnrndna.txt        lemom_kokoijijim.txt

Solution

  • Assumptions:

    • file names do not include embedded linefeeds
    • both arrays have the same number of entries
    • we're to compare array entries that have the same array index

    Adding a 'not matching' data point:

    A=("fox_abdce.txt" "rabbit_abdce.txt" "ignore_me" "lemom_asnrndna.txt")
    B=("fox_zzzzzz.txt" "rabbit_zzzedd.txt" "not_me" "lemom_kokoijijim.txt")
    

    Fixing the NUMBER_OF_DIGITS issue:

    #### replace this:
    
    NUMBER_OF_DIGITS=(3)
    
    #### with this:
    
    NUMBER_OF_DIGITS=3
    
    #### then feed to awk via a -v flag/arg, eg:
    
    awk -v awk_var_name="OS_var_value"
    

    One awk idea using process substitution:

    echo "########## matching"
    
    awk -v len="${NUMBER_OF_DIGITS}" '
    FNR==NR                                  { a[FNR]=$0; next }
    substr(a[FNR],1,len) == substr($0,1,len) { print a[FNR],$0 }
    ' <(printf "%s\n" "${A[@]}") <(printf "%s\n" "${B[@]}")
    
    echo "########## not matching"
    
    awk -v len="${NUMBER_OF_DIGITS}" '
    FNR==NR                                  { a[FNR]=$0; next }
    substr(a[FNR],1,len) != substr($0,1,len) { print a[FNR],$0 }
    ' <(printf "%s\n" "${A[@]}") <(printf "%s\n" "${B[@]}")
    

    This generates:

    ########## matching
    fox_abdce.txt fox_zzzzzz.txt
    rabbit_abdce.txt rabbit_zzzedd.txt
    lemom_asnrndna.txt lemom_kokoijijim.txt
    
    ########## not matching
    ignore_me not_me
    

    Assumptions:

    • file names do not include embedded commas (otherwise we will need to choose a different delimiter for the paste command)

    A different approach using paste to join the two sets of process substitution:

    $ paste -d, <(printf "%s\n" "${A[@]}") <(printf "%s\n" "${B[@]}")
    fox_abdce.txt,fox_zzzzzz.txt
    rabbit_abdce.txt,rabbit_zzzedd.txt
    ignore_me,not_me
    lemom_asnrndna.txt,lemom_kokoijijim.txt
    

    Feeding the paste output to awk:

    echo "########## matching"
    
    awk -F, -v len="${NUMBER_OF_DIGITS}" '
    substr($1,1,len) == substr($2,1,len)
    ' <(paste -d, <(printf "%s\n" "${A[@]}") <(printf "%s\n" "${B[@]}"))
    
    echo "########## not matching"
    
    awk -F, -v len="${NUMBER_OF_DIGITS}" '
    substr($1,1,len) != substr($2,1,len)
    ' <(paste -d, <(printf "%s\n" "${A[@]}") <(printf "%s\n" "${B[@]}"))
    

    This generates:

    ########## matching
    fox_abdce.txt,fox_zzzzzz.txt
    rabbit_abdce.txt,rabbit_zzzedd.txt
    lemom_asnrndna.txt,lemom_kokoijijim.txt
    
    ########## not matching
    ignore_me,not_me