Search code examples
bashmedian

Reading Column and Find Median (Bash)


I want to find the median for each column, however it doesn't work like what I want.

1 2 3 
3 2 1
2 1 5

I'm expecting for

2 2 3

for the result, however turns out it just give sum error and some "sum" of the column. Below is a snippet of the code for "median in column"

while read -r line; do
    read -a array <<< "$line"
    for i in "${!array[@]}"
    do
      column[${i}]=${array[$i]}
      ((length[${i}]++))
      result=${column[*]} | sort -n
    done < file
 for i in ${!column[@]}
 do
   #some median calculation.....

Notes: I want to practice bash, that's why I hard-coded using bash. I really appreciate if someone could help me, especially in BASH. Thank you.


Solution

  • Bash is really not suitable for low-level text processing like this: the read command does a system call for each character that it reads, which means that it's slow, and it's a CPU hog. It's ok for processing interactive input, but using it for general text processing is madness. It would be much better to use awk (Python, Perl, etc) for this.

    As an exercise in learning about Bash I guess it's ok, but please try to avoid using read for bulk text processing in real programs. For further information, please see Why is using a shell loop to process text considered bad practice? on the Unix & Linux Stack Exchange site, especially the answer written by Stéphane Chazelas (the discoverer of the Shellshock Bash bug).

    Anyway, to get back to your question... :)

    Most of your code is ok, but

    result=${column[*]} | sort -n
    

    doesn't do what you want it to.

    Here's one way to get the column medians in pure Bash:

    #!/usr/bin/env bash
    
    # Find medians of columns of numeric data
    # See http://stackoverflow.com/q/33095764/4014959
    # Written by PM 2Ring 2015.10.13
    
    fname=$1
    echo "input data:"
    cat "$fname"
    echo
    
    #Read rows, saving into columns
    numrows=1
    while read -r -a array; do
        ((numrows++))
        for i in "${!array[@]}"; do
            #Separate column items with a newline
            column[i]+="${array[i]}"$'\n'
        done
    done < "$fname"
    
    #Calculate line number of middle value; which must be 1-based to use as `head`
    #argument, and must compensate for extra newline added by 'here' string, `<<<`
    midrow=$((1+numrows/2))
    echo "midrow: $midrow"
    
    #Get median of each column
    result=''
    for i in "${!column[@]}"; do
        median=$(sort -n <<<"${column[i]}" | head -n "$midrow" | tail -n 1)
        result+="$median "
    done
    
    echo "result: $result" 
    

    output

    input data:
    1 2 3
    3 2 1
    2 1 5
    
    midrow: 3
    result: 2 2 3