Search code examples
linuxbashubuntuunixgnu

How to get the most counted value of a CSV column in BASH?


Good afternoon,

I would like to know how to get the most value frequented in columna B and stored into variable, in case of tie, getting the lowest value from column A according to the column B. Theses columns come from CSV delimited by ',' The column B show us always in capital letter.

Doesn't allow grep, awk, sed, csvkit

file.csv->

A,B
4,AA
3,AA
2,BB
1,BB

I tried:

var=$(tail +2 file.csv | cut -d , -f2 | sort | uniq -c)
echo $var
2 AA 2 BB

Unfortunately, I was expecting something like this (without showing the most string repeated and showing the correct string in case of tie; that's because BB has the lowest value in columna A with the number 1 in comparison to the value 3 of the AA):

echo $var
BB

Solution

  • #!/bin/bash
    
    declare -A count min
    declare -i max_count=0 max_value
    
    # loop 1: get the count and the minimum value for each B
    # and find the max value for A
    {
        read header
        while IFS=, read -r a b; do
            ((count[$b]++))
            ((count[$b] > max_count)) && max_count=${count[$b]}
            if [[ ! -v min[$b] ]] || ((a < min[$b])); then min[$b]=$a; fi
            if [[ -z $max_value ]] || ((max_value < a)); then max_value=$a; fi
        done
    } < file.csv
    
    # loop 2: find the B's with the max count
    declare -a candidates
    for c in "${!count[@]}"; do
        if ((count[$c] == max_count)); then
            candidates+=("$c")
        fi
    done
    
    # loop 3: find the minimum A value among the candidates
    min_value=$max_value
    for c in "${candidates[@]}"; do
        ((min[$c] < min_value)) && min_value=${min[$c]}
    done
    
    # loop 4: print out the candidates with the minimum A value
    for c in "${candidates[@]}"; do
        ((min[$c] == min_value)) && echo "$c"
    done