Search code examples
linuxbashawkdatatable

Adding a column to the right of a table with an unknown number of columns


I have a two tables (one is 1.csv, the other is 2.csv). The first one consists of names and scores like...

name,score
Alice,6
Bob,1
Joe,2
John,7
Michel,9

The other one has a table with not constant number of columns (sometime only one column, sometime two, sometimes three..) like...

Alice, Bob,
Michel,
Joe,John

I would like to add a new score column just right to the name field to allot his/her scores. So, the desired results are,

Alice,6,Bob,1
Michel,9
Joe,2,John,7

How to implement this? I prefer an awk/bash script to do this task. Thanks in advance,

I tried to run the following script,

#!/bin/bash

declare -A scores
while IFS=',' read -r name score
do
    scores["$name"]=$score
done < 1.csv

while IFS=',' read -r -a names
do
    for name in "${names[@]}"
    do
        name=$(echo "$name" | xargs)
        echo -n "$name,${scores[$name]},"
    done
    echo
done < 2.csv

I was expecting a result like,

Alice,6,Bob,1
Michel,9
Joe,2,John,7

It seemed OK in this MWE case, but for bigger tables, it does no longer works. How should I improve?


Solution

  • "it does no longer works" tells us nothing about the problem so it's hard to help you given that and no sample input/output with which we could reproduce the problem. Do divide-and-conquer on your large input that has the problem until you end up with a minimal version of it that still has the problem and then post a version of the input/output using that as your minimal reproducible example.

    Having said that, if you want to do this using bash and mandatory POSIX tools (i.e. no python, perl, ruby, etc.) then you wouldn't use a bash while-read loop as in your script at all, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice, the bash part would just be to call a single awk script to do the rest, see whats-the-most-robust-way-to-efficiently-parse-csv-using-awk for how to parse CSV with awk in general but the following may be adequate for your needs given the sample input you provided, using any POSIX awk:

    $ cat tst.sh
    #!/usr/bin/env bash
    
    awk '
        BEGIN { FS = OFS = "," }
        {
            sub(/\r$/,"")
            for ( i=1; i<=NF; i++ ) {
                gsub(/[[:space:]]+/," ",$i)
                gsub(/^ | $/,"",$i)
            }
        }
        NR == FNR {
            if ( NR > 1 ) {
                name2score[$1] = $2
            }
            next
        }
        {
            sub(FS"$","")
            for ( i=1; i<=NF; i++ ) {
                printf "%s%s%d%s", $i, OFS, name2score[$i]+0, (i<NF ? OFS : ORS)
            }
        }
    ' 1.csv 2.csv
    

    $ ./tst.sh
    Alice,6,Bob,1
    Michel,9
    Joe,2,John,7
    

    If you don't have a POSIX awk then change this:

            gsub(/[[:space:]]+/," ",$i)
            gsub(/^ | $/,"",$i)
    

    to this:

            gsub(/[ \t]+/," ",$i)
            sub(/^ /,"",$i); sub(/ $/,"",$i)
    

    and then it'll work in any awk if the only mid-line spaces in your input are blanks and tabs, which is probably the case.